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The  purpose  of  the  research  was  to  establish  an 
understanding  of  the  El ectrogl ottograph  (EGG)  signal  and  its 
relationship  to  vocal  fold  vibration.  The  method  adopted  is 
to  compare  the  EGG  with  data  from  simultaneously  and 
synchronously  recorded  ultra-high  speed  laryngeal  films  and 
the  acoustic  speech  signal.  A  second  concern  of  'this  study 
was  to  evaluate  the  feasibility  of  using  the  EGG  in 
improving  speech  analysis  techniques. 

The  data  measured  from  the  ultra-high  speed  laryngeal 
films  include  the  glottal  area  and  the  length  of  the  glottal 
opening.   The  acoustic  speech  wave  is  inverse  filtered  to 
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obtain  the  glottal  volume  velocity.  A  comparison  of  the  EGG 
with  these  synchronized  data  shows  that  the  EGG  is  a 
function  of  the  lateral  area  of  contact  between  the  vocal 
folds.  A  description  of  the  EGG  during  the  various  glottal 
phases  and  a  qualitative  model  for  the  EGG  are  presented. 
The  experimental  results  show  that  the  EGG  is  an  excellent 
indicator  of  the  period  of  the  glottal  vibration.  An 
algorithm  for  automatically  locating  the  glottal  opening  and 
closing  instants  from  the  EGG  is  described  and 
experimentally  evaluated.  The  spectra  of  the  EGG  and  the 
glottal  volume  velocity  are  computed  and  compared.  This 
comparison  does  not  show  a  consistent  relationship  between 
the  two  spectra. 

A  method  for  voiced/unvoiced  classification  and 
fundamental  frequency  contour  estimation  using  the  EGG  is 
described.  A  comparison  of  the  results  obtained  using  this 
method  with  the  results  from  a  speech  signal  based  method 
indicates  that  the  EGG  based  method  is  simpler,  more 
reliable  and  requires  less  computation. 

The  EGG  is  also  used  to  implement  three  pitch- 
synchronous  linear  prediction  analysis  methods.  Results  are 
presented  to  show  that  the  pitch-synchronous  closed  phase 
analysis  method  provides  an  accurate  estimation  of  the  vocal 
tract  formant  frequencies  and  bandwidths.  This  method  is 
also  used  to  implement  an  automatic,  pitch-synchronous 
technique  to  obtain  the  glottal  volume  velocity  in 
continuous  speech. 
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CHAPTER  1 
INTRODUCTION 


Speech  is  the  most  natural  medium  of  communication  for 
human  beings.  The  physiological  speech  mechanism  consists 
of  the  lungs,  vocal  folds,  and  oral  and  nasal  tracts.  The 
improper  functioning  of  any  part  of  this  system  can  result 
in  an  impairment  of  the  ability  to  speak.  The  detection, 
diagnosis  and  treatment  of  speaking  disorders  are  important 
problems  and  require  an  understanding  of  the  speech 
production  process.  There  is  a  need  to  understand  human 
speech  production  from  another  point  of  view  also;  namely  to 
find  efficient  and  reliable  methods  for  the  storage  and 
transmission  of  speech.  Closely  related  to  this  is  the 
problem  of  voice  communication  with  machines. 

A  schematic  representation  of  the  human  vocal  system  is 
shown  in  Figure  1.1.  The  lungs  act  as  a  reservoir  of  air. 
An  increase  in  the  lung  pressure  causes  the  flow  of  air  into 
the  trachea.  For  the  voiced  sounds  in  speech,  this  flow  of 
air  interacts  with  the  vocal  folds  causing  them  to 
vibrate.  The  aerodynami c-myoel astic  theory  (1)  attempts  to 
provide  a  mathematical  description  of  vocal  fold 
vibration.  The  vibration  of  the  vocal  folds  "chops"  the  air 
flow  into  discrete  pulses  that  act  as  an  acoustic  sound 
source.   This  glottal  sound  source  is  filtered  by  the   vocal 
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Figure  1.1  Schematic  representation  of  the  human 
speech  production  system  (from  (2)). 


and  nasal  tracts  and  is  radiated  into  the  atmosphere  as  a 
pressure  wave  at  the  mouth  and  nostrils.  The  mathematical 
description  of  the  filtering  process  imposed  by  the 
supraglottal  tract  is  the  acoustic  theory  of  speech 
production  (3). 

Vocal  fold  vibration  and  its  acoustic  correlate,  the 
glottal  sound  source,  are  as  yet  poorly  understood  areas  of 
speech  research  (4).  There  are  several  problems  where  such 
knowledge  is  of  importance,  e.g.,  the  detection  and 
treatment  of  laryngeal  disorders  (5),  training  aids  for  the 
hearing  impaired  (6),  the  synthesis  of  natural  sounding 
speech  (7),  and  improved  modeling  of  fhe  speech  signal. 

The  relative  inaccessibility  of  the  larynx  makes  direct 
observation  of  vocal  fold  vibration  in  vivo  impossible.  One 
must,  therefore,  resort  to  various  indirect  observation 
techniques,  e.g.,  indirect  laryngoscopy,  ultra-high  speed 
photography,  photogl ottography ,  ultrasound,  X-rays, 
el ectrogl ottography  and  inverse  filtering  of  the  acoustic 
speech  signal.  A  review  of  these  methods  can  be  found  in 
reference  5  and  8.  Most  of  these  indirect  methods  have  one 
or  more  of  the  following  drawbacks:  The  procedure  is 
difficult  to  apply  and  so  can  be  used  with  only  a  limited 
cross  section  of  the  population;  the  appartus  is  difficult 
to  obtain  and  maintain;  or  the  method  can  be  used  with  only 
a  limited  range  of  phonations. 


El ectrogl ottography ,  in  contrast,  appears  to  be  an 
inexpensive  procedure  that  can  be  used  with  a  majority  of 
the  population  and  over  a  wide  range  of  phonations. 
El ectrogl ottography  is  basically  an  electrical  impedance 
measuring  technique  (9)  and  has  its  origins  in  the  work  of 
Fabre'  (10).  The  principle  behind  the  device  is  simple 
(11):  A  pair  of  electrodes  is  applied  to  the  neck  at  the 
level  of  the  larynx.  A  high  frequency  (about  5  MHz)  current 
passes  from  one  electrode  through  the  neck  and  is  picked  up 
by  the  other  electrode.  As  the  subject  phonates,  the 
opening  and  closing  of  the  vocal  folds  change  the  electrical 
impedance  of  the  neck  in  the  region  of  the  electrodes.  This 
modulates  the  radio  frequency  (RF)  current,  which  is 
demodulated  using  a  detector  to  yield  the  el ectrogl ottograph 
(EGG)  signal.  The  EGG  is  presumably  a  measure  of  the 
changing  electrical  impedance  at  the  neck,  and  hence  of  the 
vocal  fold  vibration.  Figure  1.2  is  a  block  diagram 
illustrating  this  principle. 

While  el  ectrogl ottography  was  proposed  in  1957, 
successful  implementation  of  the  method  was  accomplished 
only  recently  (11,12).  The  primary  difficulty  appears  to 
have  been  the  instrumentation  of  the  method;  a  good 
description  of  requirements  of  the  measurement  procedure  and 
the  design  of  a  device  to  meet  the  requirements  is  given  in 
reference  11. 

In   spite   of   its   apparent  usefulness  as   a  glottal 
sensor,   el ectrogl ottography   is   not  well   understood,   and 
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Figure  1.2  Schematic  of  the  Electroglottographic  Technique 


hypotheses  about  the  EGG  signal  have  not  been  sufficiently 
validated.  The  primary  concern  of  this  study  is  to 
establish  an  understanding  of  the  EGG  and  its  relation  to 
vocal  fold  activity  in  normal  adult  speakers.  The 
methodology  of  the  research  is  a  comparison  of  the  data 
obtained  from  ultra-high  speed  laryngeal  films,  the  EGG  and 
the  acoustic  speech  wave.  A  second  concern  of  this  study  is 
to  evaluate  the  feasibility  of  using  the  EGG  as  a  second 
channel  of  information  for  improving  existing  speech 
analysis  techniques. 

The  organization  of  this  dissertation  is  as  follows: 
The  experimental  data  base  used  in  this  study  is  described 
in  Chapter  2.  We  discuss  there  the  data  collection 
procedure,  the  equipment  used,  the  measurements  made  on  the 
high  speed  films  and  the  synchronization  and  preprocessing 
of  the  signals. 

In  Chapter  3,  we  compare  the  EGG  with  the  data  from  the 
high  speed  films.  Based  on  this  comparison,  the  various 
phases  in  the  EGG  are  identified,  and  a  qualitative  model 
for  the  EGG  is  presented. 

The  linear  source-filter  model  (13)  for  voiced  speech 
is  discussed  in  Chapter  4,  and  the  necessity  for  closed 
phase  analysis  of  the  speech  signal  is  established.  The 
closed  phase  covariance  method  of  glottal  inverse  filtering 
is  then  introduced  as  one  possible  method  for  closed  phase 
analysis.  This  method  is  used  to  obtain  the  glottal  volume 
velocity  from  the  speech   data   in  the  experimental   data 


base.  Finally,  temporal  and  spectral  comparisons  of  the 
glottal  volume  velocity,  the  EGG  and  the  glottal  area  are 
presented . 

Chapter  5  deals  with  the  applications  of  the  EGG  in 
speech  analysis.  A  method  for  voiced/unvoiced 
classification  and  pitch  period  estimation  using  the  EGG  is 
described.  Results  obtained  using  this  method  are  compared 
with  the  results  from  a  method  using  only  the  speech 
signal.  The  autocorrelation  and  covariance  methods  of 
linear  prediction  analysis  of  the  speech  signal  are  then 
discussed.  The  use  of  the  EGG  to  segment  the  speech 
waveform  into  individual  pitch  periods,  and  further  into  the 
closed  and  open  phases  in  each  pitch  period  is  described. 
We  introduce  and  compare  three  pitch-synchronous  linear 
prediction  analysis  methods.  Results  from  the 
autocorrelation  and  the  three  pitch-synchronous  linear 
prediction  methods  are  discussed. 

We  summarize  the  important  results  and  conclusions  of 
this  study  in  Chapter  6.  A  number  of  problems  for  future 
research  are  also  identified. 


CHAPTER  2 
EXPERIMENTAL  DATA  BASE:   COLLECTION  AND  MEASUREMENT 

The  primary  goal  of  this  investigation  was  to  establish 
an  understanding  of  the  EGG  by  relating  its  features  to 
vocal  fold  vibration.  To  achieve  this  goal,  we  decided  to 
compare  the  EGG  with  the  vocal  fold  vibrations  and  the 
speech  waveform  obtained  simultaneously  and  synchronously  on 
ultra-high  speed  films  and  magnetic  tape,  respectively.  In 
this  chapter,  we  describe  the  data  collection  and 
measurement  procedures  used  to  obtain  the  experimental  data 
used  in  this  study. 


Subjects  and  Tasks 

Four  normal  adult  males  (JMN,  DMK,  GPM  and  AKK),  who 
possessed  no  evidence  of  voice  disorders  or  laryngeal 
pathology  were  the  subjects  used  in  the  study.  The 
experimental  tasks  for  each  of  the  subjects  consisted  of 
phonation  of  the  vowel  /i/  at  three  different  intensities  at 
each  of  three  different  fundamental  frequencies.  The  vowel 
/i/  was  chosen  so  that  the  epiglottis  was  held  out  of  the 
optical  pathway  of  the  vocal  fold  image  during  filming; 
however,  because  the  tongue  was  held  down  and  a  laryngeal 
mirror  was  used,  the  procedure  resulted  in  a  sound  closer  to 
an  /a/  in  most  cases.  The  recorded  phonation  was  sustained 
for  about  three  seconds. 


The  three  fundamental  frequencies  used  were  125  Hz,  170 
Hz  and  340  Hz;  to  control  the  fundamental  frequency  during 
the  experiments,  the  subjects  were  asked  to  match  a  pure 
tone  of  the  appropriate  frequency  that  they  heard  over  a 
pair  of  headphones.  The  three  different  intensities  at  each 
fundamental  frequency  represent  a  "comfortable"  intensity, 
an  intensity  approximately  4dB  above  it  and  another 
intensity  about  4dB  below  it.  The  actual  intensities 
produced  were  monitored  using  a  sound  level  meter. 

Thus,  there  were  nine  tasks  for  each  subject,  for  a 
total  of  thirty-six  tasks. 


Data  Collection  and  Equipment 

High  Speed  Photography 

The  technique  of  ultra-high  speed  photography  of  the 

vibrating  vocal  folds  is  described  in  (8,14).   Briefly,  a 

laryngeal  mirror  is  held  in  the  subject's  mouth  at  the  back 

of  the  pharynx.   A  high  intensity  light  source  is  focused 

onto  the  mirror,  which  reflects  the  light  beam  90°  downwards 

onto  the  vocal   folds.    The  image  of  the  vocal   folds, 

reflected  by  the  same  mirror,  is  focused  by  a  system  of 

lenses  to  a  high  speed  camera.   As  the  subject  phonates,  the 

details  of  the  vibration  are  captured  on  the  film.   The  film 

can  then  be  played  back  later  at  a  slower  speed  to  view  the 

detailed  vibratory  behavior. 

The  photographic  equipment  and  configuration  used  for 
this  study  have  been  described  elsewhere  (8).    The  high 


10 


speed  camera  used  was  a  Fastax  model  WF-14,  which  is  capable 
of  exposure  rates  of  8000  frames/second.  The  camera 
controls  were  adjusted  to  obtain  a  film  speed  of  5000 
frames/second  over  the  last  portion  of  the  film,  when  the 
exposure  rate  is  nearly  constant. 

The  camera  has  two  lens  systems  through  which  images 
can  be  photographed.  The  first  lens  system  was  used  for  the 
photography  of  the  vocal  folds.  A  grid,  adjusted  to  be  in 
the  focal  plane  of  the  vocal  folds,  was  also  photographed 
via  this  lens  system.  This  allows  absolute  measures  of 
vocal  fold  vibratory  patterns  to  be  made,  while  previously 
only  relative  measures  were  possible. 

The  second  lens  system,  specifically  designed  to 
photograph  an  oscilloscope  face,  protrudes  from  the  side  of 
the  camera.  Two  timing  signals  (to  be  described  later)  were 
photographed  via  this  lens.  The  traces  of  these  two  timing 
signals  were  adjusted  to  lie  along  one  edge  of  the  film. 
The  EGG  waveform  was  also  displayed  on  a  third  trace  of  the 
oscilloscope.  This  trace  was  positioned  on  the  other  edge 
of  the  film  away  from  the  two  timing  traces.  Because  the 
two  lens  systems  are  at  a  90°  angle,  the  three  oscilloscope 
traces  appear  on  a  film  frame  that  is  displaced  five  frames 
behind  the  film  frame  recording  the  corresponding  vocal  fold 
image. 

The  high  speed  films  used  were  one  of  two  different 
types--the  black  and  white  Kodak  7277  4-X  reversal  film  or 
the  color  Kodak  Ektrachrome  7250  high  speed  video  news  film. 
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EGG  and ' Speech  Si  gnal s 

The  speech  signal  was  obtained  using  a  hearing  aid 
microphone  coupled  directly  to  one  channel  of  a  stereo  tape 
recorder.  The  tape  recorder  used  was  either  a  Revox  A77  or 
a  Teac  A-2060.  The  microphone  was  attached  to  the  handle  of 
'the  laryngeal  mirror  at  the  point  where  the  mirror  frame 
joins  with  the  handle.  The  distance  of  the  glottis  from 
this  point  varies  from  subject  to  subject,  but  was 
approximately  11  cm  in  most  cases.  The  microphone  was  used 
at  this  particular  location  to  shield  the  audio  signal  from 
the  camera  motor  noise.  The  audio  bandwidth  of  the 
microphone  has  been  measured  to  be  about  6  KHz  with  a  slight 
peak  at  4  KHz. 

The  EGG  signal  was  obtained  using  an  el ectrogl ottograph 
designed  by  D.  Teaney  and  manufactured  by  Synchrovoice 
Associates.  The  EGG  was  recorded  on  one  channel  of  a  Sony 
model  TC530  stereo  tape  recorder.  The  rise  and  fall  times 
of  the  EGG  circuits  were  tested  using  a  square  wave 
calibration  circuit;  this  is  described  in  Appendix  A. 

The  second  channel  of  both  tape  recorders  was  used  to 
record  10  KHz  square  wave  timing  signal.  Both  tape 
recorders  were  run  at  7-1/2  ips  to  obtain  a  flat  frequency 
response  from  50  Hz  to  5  KHz. 


The  Timing  Signals 

A  special  time  code  generator  has  been  designed  to 
allow  temporal  synchronization  of  the  EGG,  speech  and  film 
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data  (15).   The  time  code  generator  provides  three  timing 
signals:    a  10  KHz  square  wave  that  was  recorded  on  the 
second  channel  of  both  tape  recorders,  a  5  KHz  square  wave 
derived   from  the   10  KHz   signal,   and  an  8  bit  counter 
signal.   The  latter  two  timing  signals  were  photographed  on 
the  laryngeal  film  via  the  oscilloscope  face  as  described 
earlier.   The  8  bit  counter  signal  tracks  the  number  of  100 
cycles   of   the   5   KHz   signal   that   occur   following   the 
initiation  of  the  timing  signals;  in  other  words,  the  8  bit 
counter  is  at  0  V  except  after  e\,ery    100  cycles  of  the  5  KHz 
signal.   At  these  instants,  the  counter  value  is  incremented 
by  1,  and  the  new  count  asserted  on  the  counter  signal 
line.   Thus,  given  any  frame  in  the  film,  if  N  is  the  number 
of  cycles  of  the  5  KHz  square  wave  between  this  frame  and 
the  last  counter  output,  and  the  decimal  value  of  the  last 
counter  output  is  k,  then  lOOk+N  cycles  of  the  5  KHz  square 
wave  have  elapsed  between  the  initiation  of  the  timing  and 
this  frame. 

The  10  KHz  signal,  recorded  on  the  second  channel  of 
the  tape  recorders,  was  used  as  the  external  clock  signal 
for  the  Analog  to  Digital  (A-D)  converter  while  digitizing 
the  speech  and  EGG  signals.  Since  all  the  timing  signals 
were  initiated  simultaneously,  and  the  5  KHz  clock  is 
obtained  from  the  10  KHz  signal,  the  number  of  samples 
corresponding  to  100k  +  N  cycles  of  the  5  KHz  square  wave  is 
200k  +  2N.  Thus,  the  (200k  +  2N)th  sample  of  the  EGG  is 
temporally  aligned  with  the  given  film  frame.    For  the 
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speech  signal,  the  corresponding  sample  number  for 
synchroniation  is  (200k  +  2N  +  d)  where  d  is  an  additional 
factor  to  account  for  the  propagation  delay  from  the  glottis 
to  the  microphone. 


Data  Measurements  and  Preprocessing 

Film  Data  Measurements 

A  number  of  parameters  were  measured  from  the  high 
speed  films  to  characterize  the  vibration.  The  operator 
projected  the  film  onto  a  screen  a  frame  at  a  time  using  an 
Athena  224-ES  stop  frame  projector.  A  segment  of  the  film 
where  the  number  of  5  KHz  square  wave  cycles  between 
successive  counter  outputs  was  close  to  100  was  isolated  at 
this  time.  This  represents  a  segment  where  the  film  speed 
was  close  to  5000  frames  per  second.  One  hundred  and  fifty 
frames  from  this  section  were  chosen  for  analysis. 

We  have  described  elsewhere  a  semi -automated , 
computerized  system  for  the  analysis  of  high  speed  laryngeal 
films  (16).  The  hardware  for  this  system  consists  of  a 
Vidicon  TV  camera  attached  to  a  Spatial  Data  Systems  EyeCom 
108PT  image  processing  terminal.  This  terminal,  interfaced 
to  a  Data  General  NOVA  4  minicomputer,  has  the  capability  of 
displaying  video  images  along  with  superposed  graphics.  The 
display  screen  is  divided  into  640  x  480  coordinate 
locations.  A  cursor,  controlled  by  the  operator  with  a 
joystick,  can  be  moved  to  any  desired  location  on  the 
screen,   and   the   cursor   coordinates   transferred   to   the 
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computer.  The  terminal  is  also  capable  of  digitizing  images 
with  a  640  x  480  pixels  spatial  resolution  and  an  intensity 
resolution  of  256  gray  levels. 

The  operation  of  the  film  measurement  system  can  be 
briefly  described  as  follows: 

The  glottal  images  are  projected  using  the  Athena  stop 
frame  projector  onto  a  45°  mirror,  which  reflects  the  image 
upwards  onto  a  translucent  screen.  The  image  formed  on  the 
screen  is  scanned  by  the  TV  camera  and  displayed  on  the 
EyeCom  display.  The  operator,  using  the  joystick  cursor, 
measures  the  length  of  the  glottis  and  the  width  at  five 
chosen  locations.  The  glottal  boundary  can  then  be 
approximated  using  a  number  of  straight  lines.  The  computer 
program  calculates  the  glottal  area  using  this  straight  line 
approximation.   This  procedure  is  illustrated  in  Figure  2.1. 

The  EGG  trace,  photographed  on  the  film,  can  also  be 
digitized  using  the  system.  The  EGG  signal  obtained  by  this 
method  will  henceforth  be  referred  to  as  the  traced  EGG  in 
this  study. 

The  difficulty  of  outlining  the  glottal  image 
consistently  introduces  noise  in  the  measured  values  of  the 
glottal  area,  length  and  widths.  The  traced  EGG  is  also 
noisy  because  of  the  limited  spatial  resolution  of  the 
EyeCom  terminal.  Consequently,  these  measures  have  to  be 
suitably  smoothed.  Since  the  smoothing  technique  needs  to 
preserve  the  abrupt  changes  in  the  signals  between  glottal 
phases  while  eliminating  sharp,  point  like  jumps,  purely 
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Wl    W2    W3    W4    W5 

VP 

W1-W5  are  the  locations  where  the  widths  are  measured 

A  -  Anterior 

P  -  Posterior 


VP  -  Vocal  Process 


Outline  of  the  glottal  contour  using  straight  lines 
to  measure  the  glottal  area 


Figure  2.1  Measurement  of  a  laryngeal  film  frame 
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linear   smoothing   methods   are   unsuitable.     Instead,   a 
combination   of   nonlinear   median   smoothing   and   linear 
smoothing  as  described  in  (17)  was  used. 
Digitization  and  Preprocessing  of  the  EGG  and  Speech 

As  explained  earlier,  the  10  KHz  timing  signal 
recorded  on  the  second  channel  of  the  tape  recorders  was 
used  as  the  clock  source  for  the  A-D  converter  in  digitizing 
the  speech  and  EGG  signals.  Due  to  the  limited  bandwidth  of 
the  tape  recorders,  the  timing  signal  was  passed  through  a 
waveshaping  circuit  to  obtain  "clean"  square  waves.  Small 
variations  in  the  tape  speed  and  jitter  in  the  waveshaping 
circuit  are',  however,  sufficient  to  introduce  errors  in  the 
synchronization  amongst  the  various  signals. 

After  digitization,  the  EGG  and  speech  signals  are 
subject  to  two  stages  of  preprocessing:  Correction  for  tape 
recorder  distortion  and  highpass  filtering  to  remove  noise 
and  power  line  components. 

Tape  Distortion.  The  capacitor  coupling  used  in  normal 
audio  tape  recorders  introduces  phase  and  magnitude 
distortion,  primarily  in  the  low  frequency  region  below  200 
Hz.  This  distortion  can  signifcantly  affect  the  results 
obtained  from  the  inverse  filtering  of  the  speech  waveforms 
(18).  In  the  EGG,  the  distortion  is  manifested  as  a 
downward  slope  in  the  EGG  during  the  glottal  open  phase. 
Berouti  (19)  has  described  a  method  for  correcting  such 
distortion.  The  method,  described  in  Appendix  B,  involves 
the  derivation  of  the  tape  recorder  transfer  function  using 
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a  reference  signal.  The  traced  EGG  was  used  as  the 
reference  in  correcting  the  recorded  EGG  for  the 
distortion.  This  enabled  the  correction  parameters  to  be 
obtained  for  each  task.  A  similar  reference  signal  is 
however  not  available  for  the  speech  signal.  Consequently, 
a  fixed  correction  had  to  be  derived  and  applied  to  all  the 
speech  waveforms.   This  is  also  discussed  in  Appendix  B. 

Highpass  Filtering.  The  speech  and  EGG  waveforms  were 
band  pass  filtered  using  a  351  point,  linear  phase  FIR 
filter  (20).  The  transfer  function  of  the  filter  is  shown 
in  Appendix  C. 

Data  Synchronization 

We  have  already  explained  the  procedure  for 
theoretically  synchronizing  the  different  glottal  waveforms 
using  the  timing  signal.  In  practice,  we  found  that  small 
synchronization  errors  existed  after  following  the  alignment 
procedure.  These  errors  were  primarily  due  to  the  sampling 
errors  during  digitization,  as  explained  above.  However, 
the  traced  EGG  is  obtained  from  the  films,  and  is 
consequently  in  perfect  alignment  with  the  film  data.  The 
approach  adopted  to  solve  the  synchronization  problem  was 
therefore  the  following: 


!)  the  EGG  obtained  from  the  tape  was  shifted 
sufficiently  to  align  it  with  the  traced  EGG.  This 
typically  involved  shifts  of  less  than  10  samples;  and 
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2)  the  assumption  was  made  that  the  speech  and  EGG 
obtained  from  the  tapes  are  in  synchronization;  therefore, 
the  speech  signal  is  also  shifted  by  the  same  amount  as  the 
shift  required  to  align  the  traced  and  recorded  EGG 
signals.  The  speech  was  further  shifted  by  four  samples  to 
compensate  for  the  acoustic  propagation  delay  from  the 
glottis  to  the  microphone. 

Potential  Errors  in  the  Data  Sets 


Film  Data 

There  are  two  primary  sources  of  error  in  the  data 
measured  from  the  films: 

1)  the  entire  vocal  folds  from  the  anterior  to  the 
posterior  may  not  be  exposed  in  the  film.  Typically,  this 
is  due  to  the  shadowing  of  the  anterior  portions  of  the 
vocal  folds  by  the  epiglottis.  The  occluded  portion  of  the 
glottis  is  left  out  of  the  measurement  because  of  the 
difficulty  of  extrapolating  the  glottal  contour  over  this 
portion.  This  is  a  possible  source  of  systematic  error  in 
the  film  data;  and 

2)  the  second  source  of  error  is  the  inaccuracies  in 
the  measurement  process  itself;  such  errors  are  discussed  in 
(21). 


Digitizied  Data 

The  errors  in  the  digitized  data  also  arise  from  two 
sources: 
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1)  the  synchronization  errors  due  to  the  sampling 
process;  and 

2)  the  errors  due  to  the  tape  recorder  distortion. 

The  traced  EGG  can  be  used  to  correct  the  recorded  EGG 
and  reduce  these  errors  significantly.  The  use  of  a  4 
channel  FM  recorder  in  future  work  should  eliminate  both  of 
these  problems. 


CHAPTER  3 

A  STUDY  OF  THE  SYNCHRONIZED 

ULTRA-HIGH  SPEED  FILMS  AND  THE 

ELECTROGLOTTOGRAPH  SIGNAL 


Introducti  on 


The  importance  of  the  el ectrogl ottograph  signal  (EGG) 
as  a  method  of  assessing  vocal  fold  vibration  was  pointed 
out  in  Chapter  1.  The  interpretation  of  the  various  phases 
and  features  of  the  EGG  was  identified  as  a  research  goal 
there.  The  experimental  data  base  on  which  this  study  is 
based  and  the  various  measures  of  vocal  behavior  obtained 
from  this  data  base  were  developed  in  Chapter  2.  In  this 
chapter,  the  glottal  area  and  the  length  of  the  glottal 
opening  measured  from  the  ultra-high  speed  films,  and  visual 
observations  of  these  films  are  used  to  analyze  and 
interpret  the  synchronized  EGG.  The  plan  of  this  chapter  is 
as  follows: 

Since  some  understanding  of  the  structure  of  the  normal 
vocal  folds  and  the  vibration  of  the  vocal  folds  in  normal 
voice  is  essential  to  studying  the  EGG,  these  are  discussed 
in  the  next  two  sections. 

The  current  interpretations  of  the  EGG  and  evidence  on 
which  these  are  based  are  outlined  in  the  third  section. 

The  next  three  sections  are  concerned  with  correlating 
the  EGG  with  the  glottal  area,  the  length  of  glottal  contact 
and  visual  observations  from  the  films  respectively. 
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Finally,  in  the  last  section  a  qualitative  model  for 
the  EGG  is  presented,  based  on  the  results  of  the  previous 
three  sections. 

Structure  of  the  Vocal  Folds 

The  human  vocal  folds  contained  within  the  larynx  are 
the  basic  vibrators  that  provide  the  source  for  the  voiced 
sounds   of   speech.     The   morphological   and   histological 
structure  of  the  vocal  folds  is  therefore  of  considerable 
importance  in  speech  science  and  has  been  the  subject  of 
much  research  (22,23).   It  is  only  the  free  surface  of  the 
vocal  folds  that  take  part  in  the  vibration,  and  this  is 
typically  described  as  consisting  of  two  layers,  a  body  and 
a  cover  (22).   The  body  is  the  vocalis  muscle  and  the  cover 
is  the  mucosal   layer  covering  this  muscle.    A  schematic 
representation  of  this  layer  structure  is  shown  in  Figure 
3.1.   This  separation  of  the  vocal  folds  into  two  layers  is 
considered  essential   in  sustaining  vocal   fold  vibrations 
(1). 

The  vocal  folds  act  as  a  mechanical  vibrator,  and  so 
adequate  lubrication  of  this  mechanism  is  necessary  for 
their  proper  and  sustained  functioning  (24).  This 
lubrication  is  provided  by  the  mucus  squirted  on  to  the 
cords  by  the  ventricular  glands.  Therefore,  as  pointed  out 
by  Fourcin  (25),  the  mucus  can  be  considered  a  third  layer 
or  part  of  the  vocal  folds.  While  the  mucus  is  typically 
left  out  in  most  discussions  of  vocal  fold  vibration,  it  can 
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Figure  3.1   Structure  of  the  human  vocal  folds  (from 
(23),  with  permission  of  the  author). 
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influence  the  EGG  considerably,  as  will  be  evident  in  the 
sequel . 


Vibration  of  the  Vocal  Folds 


Observations  of  the  vibrations  of  the  excised  vocal 
folds  using  stroboscopy  (26)  and  of  the  normal  folds  during 
phonation  using  high  speed  films  (8)  reveal  that  the  vocal 
folds  undergo  complex  three-dimensional  movements.  Phase 
differences  during  a  vibratory  cycle  exist  among  the 
different  portions  of  the  vocal  folds,  both  along  their 
thickness  and  their  length.  These  complicated  wave-like 
behaviors  are  being  understood  and  modeled  only  of  late 
(27,28). 

It  is  now  generally  accepted  that  during  normal  chest 
voice  phonation  the  more  inferior  body  of  the  vocal  folds 
vibrates  out  of  phase  with  respect  to  the  more  superior 
cover.  In  fact,  according  to  the  current  flow  separation 
theory  of  vocal  fold  vibration,  it  is  this  phase  difference 
that  transfers  energy  from  the  air  flowing  through  the 
glottis  to  the  vibrating  system  (1). 

Most  descriptions  of  vocal  fold  vibration  divide  a 
single  vibratory  cycle  into  at  least  3  distinct  phases:  i) 
an  opening  phase  during  which  the  vocal  folds  pull  apart 
increasing  the  area  of  the  glottal  opening,  11)  a  closing 
phase  during  which  the  vocal  folds  come  together  reducing 
the  glottal  area,  and  111)  a  closed  phase  during  which  the 
vocal   folds   are   maximally   closed.     Note   that   in   some 
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vibratory  modes  as  in  a  breathy  voice,  a  distinct  closed 
phase  may  not  exist  and  the  area  of  the  glottal  opening 
shows  an  almost  sinusoidal  variation  with  time. 

Based  on  observations  using  excised  larynges  (26,29), 
ultra-high  speed  photography  (8,14,30),  ultrasonography  (31) 
and  X-ray  stroboscopy  (32),  the  movements  of  the  vocal  folds 
during  these  three  phases  in  normal  chest  voice  can  be 
described  as  follows: 

During  the  opening  phase,  the  vocal  folds  first 
separate  inferiorly  and  the  opening  moves  upwards  with  a 
wave  like  motion  in  the  mucous  membrane.  Occasionally,  the 
opening  first  appears  on  the  superior  surface  as  a  small 
"chink"  which  then  opens  up  in  a  "zipper"  like  fashion. 

The  closing  phase  begins  with  contact  between  the  lower 
edges  of  the  glottis.  The  closure  then  proceeds  along  the 
length  of  the  lower  edge  and  is  then  followed  by  the  mucosal 
layers  coming  together. 

The  closed  phase  is  not  necessarily  associated  with  an 
increasing  amount  of  contact  between  the  vocal  folds.  It  is 
often  observed  (26)  that  as  the  vocal  folds  come  into 
contact  in  a  vertical  plane,  they  may  be  pulling  apart  at 
the  same  time  in  a  different  vertical  plane. 

A  schematic  representation  of  vocal  fold  vibration 
observed  in  an  excised  canine  larynx  (26)  is  shown  in  Figure 
3.2.  This  figure  serves  to  elucidate  the  verbal  description 
gi  ven  above . 
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Figure  3.2  Schematic  representation  of  vocal  fold 

vibration  in  chest  voice  phonation  (from 
(26)  with  permission  of  the  author). 
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Interpretations  of  the  EGG 


Almost  all  the  current  interpretations  of  the  EGG  are 
based  on  correlating  the  EGG  waveform  with  one  or  more 
simultaneously  and  synchronously  obtained  gl ottographi c 
signals.  Since  no  one  gl ottographi c  signal  provides 
complete  information  about  the  vibration,  the  observed 
behavior  is  then  extrapolated  based  on  the  knowledge  of  the 
expected  behavior  of  the  vibrating  vocal  folds.  The  present 
study  uses  the  glottal  volume  velocity  obtained  by  inverse 
filtering  the  acoustic  speech  wave  and  the  high  speed  films 
of  the  vocal  folds  as  the  corroborative  gl ottographi c 
signals.  Different  gl ottographi c  signals  provide  evidence 
of  different  aspects  of  vocal  fold  vibration,  so  it  is 
useful  to  review  the  gl  ottographi c  signal--EGG  studies  that 
have  been  done. 

Fant  et  al  .  (33)  correlated  the  EGG  with  optical 
glottography  and  inverse  filtering  of  the  speech  wave  and 
concluded  that  i)  the  flat  portion  of  the  EGG  corresponds  to 
the  glottal  open  phase,  ii)  the  rapid  fall  in  the  EGG 
corresponds  to  the  closing  portion,  and  iii)  the  ascending 
portion  of  the  EGG  is  when  the  vocal  folds  are  opening.  A 
"slope  break"  in  the  opening  phase  of  the  EGG  was  sometimes 
seen,  and  this  corresponds  to  the  opening  instant. 

Fourcin  (34)  studied  the  EGG  combined  with  stroboscopic 
photography  and  concluded  that  there  is  an  antiphase 
relationship  between  the  EGG  and  the  glottal  area  of 
opening.   He  also  states 
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.  .  .the  electrical  output  is  only  really  significant 
during  the  period  of  vocal  fold  closure..  .  .(34,  page 
318)  . 

Fog-Pedersen  (35)  combined  EGG  with  stroboscopic 
observation  and  based  on  this  arrived  at  the  representation 
of  the  EGG  during  a  single  cycle  as  shown  in  Figure  3.3. 

Lecluse  (36)  also  combined  el ectrogl ottography  with 

simultaneous  stroboscopic  observations  and  postulated  the 

model  for  the  EGG  shown  in  Figure  3.4.   He  also  measured 

numerous  quotients  from  the  EGG  and  identified  two  basic 

forms  of  the  EGG: 

a  broad  el ectrogl ottogram ,  which  occurred  mainly  in  the 
low-frequency  range  (  below  150  Hz),  and  a  narrow,  nearly 
symmetrical  el  ectrogl  ottogram  ,  which*  occurred  principally 
in  the  frequency  range  above  150  Hz. (36,  page  162) 

Fourcin  (25)  and  Rothenberg  (37)  correlated  the  EGG 
with  the  glottal  volume  velocity  derived  by  inverse 
filtering  the  acoustic  speech  signal.  Rothenberg  has  used 
the  idealized  model  for  the  EGG  shown  in  Figure  3.5  to 
describe  the  features  in  the  EGG.  He  notes  that  the  start 
of  the  glottal  open  phase  can  be  usually  associated  with  a 
discontinuity  in  the  slope  of  the  EGG.  This  is  in  keeping 
with  the  observations  of  Fant  (33)  and  Fourcin  (25). 

Smith  (15)  and  Childers,  Smith  and  Moore  (4)  combined 
the  EGG  with  observations  of  the  ultra-high  speed  films 
taken  simultaneously.  They  measured  the  length  of  contact 
of  the  vocal  folds  along  the  midsagittal  plane  and  found  a 
high  degree  of  correlation  between  this  length  and  the 
EGG.   Their  observations  support  the  model  of  Figure  3.5. 
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EGG 


1  Maximum  opening  phase 

2  Maximum  closing  phase 

Points  3  and  4  are  changes  from  the  plateau  to 
the  glottal  slope  of  the  glottographic  curves. 


Figure  3.3  Fog-Pedersen's  model  for  the  EGG  (after  35) 
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EGG 


1  is  the  moment  of  initial  closure  at  a  single  point 

2  is  the  moment  at  which  closure  is  completed  over  the  whole 

length,  but  not  in  the  vertical  plane 

3  is  the  moment  at  which  closure  is  compel eted  over  the  whole 

vertical  plane 

4  is  the  moment  at  which  opening  begins 

5  is  the  moment  at  which  time  whole  length  is  open 


Figure  3.4  Lecl use's  model  for  the  EGG  (after  36). 
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EGG 


1-2  vocal  folds  maximally  closed 

3-4  folds  separating  from  lower  margins 
towards  upper  margins 

3-5  upper  fold  margins  separating 

7   lower  margins  close 
3-7  folds  apart 

1  closure  reaches  upper  fold  margins 


Figure  3.5  Rothenberg's  model  for  the  EGG  (after  37) 
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Baer,  Titze  and  Yoshioka  (38)  studied  synchronized  EGG, 
photogl ottography  and  the  glottal  volume  velocity.  Their 
experiments  support  the  conclusions  of  Rothenberg  and 
Fourcin. 

Recently,  Baer,  Lofquist  and  McGarr  (39)  compared  the 
information  obtained  from  synchronized  high  speed  films, 
photogl ottography  and  EGG.  Results  are  presented  for  one 
task  from  a  male  and  one  task  from  a  female  subject.  They 
found  that  the  minimum  in  the  EGG,  corresponding  to  maximum 
glottal  contact,  seems  to  occur  at  the  instant  of  glottal 
closure.  The  instant  of  glottal  opening  coincided  with  a 
slope  discontinuity  in  the  EGG  in  the  example  from  the  male 
subject.  Glottal  opening  for  the  female  subject  was  gradual 
with  large  horizontal  phase  differences  along  the  length  of 
the  folds. 

Glottal  Area  and  the  EGG 


The  projected  glottal  area  as  measured  from  the  high 
speed  films  has  been  used  as  a  "measure"  of  vocal  fold 
vibration.  Several  parameters  can  be  defined  to 
characterize  the  vibration  (8,30).  The  glottal  area  has 
also  been  used  to  determine  the  instants  of  glottal  closure 
and  opening.  Thus  the  first  step  in  studying  the  EGG  is  to 
establish  correspondence  between  the  glottal  area  and  the 
EGG. 

Figures  3.6-3.9  are  plots  of  synchronized  EGG, 
differentiated  EGG  and  glottal  area  for  a  number  of  typical 
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tasks  from  our  data  base.   The  arrangement  of  these  plots  is 
as  follows: 

The  first  graph  in  each  plot  is  the  EGG  (EGG).  Next  is 
the  differentiated  EGG  (D-EGG)  and  the  last  is  of  the 
glottal  area  (AREA).  Two  sets  of  dashed  vertical  lines  have 
been  drawn  in  these  graphs.  The  first  set  is  drawn  at  the 
glottal  opening  and  closing  instants  in  each  pitch  period  of 
the  glottal  area.  Also  included  in  this  set  are  vertical 
lines  at  the  maximum  value  of  glottal  area  in  each  period. 
The  second  set  of  vertical  lines  is  drawn  at  the  maximum  and 
minimum  in  the  differentiated  EGG  in  each  pitch  period.  The 
significance  of  this  second  set  of  lines  will  be  obvious 
shortly. 

Based  on  the  study  of  such  plots  for  all  the  tasks  in 
the  data  base,  we  describe  the  EGG  during  the  different 
glottal  phases  in  the  next  subsection. 
Description  of  the  EGG 

Closed  Phase.  The  start  of  the  closed  phase  is  usually 
associated  with  a  rapid  decrease  in  the  EGG.  The  minimum  in 
the  EGG,  corresponding  to  maximum  lateral  contact,  occurs  in 
the  closed  phase  after  glottal  closure.  The  behavior 
reported  by  Baer  et  al .  (39)  in  which  the  minimum  in  the  EGG 
occurred  at  the  instant  of  glottal  closure  was  not  observed 
in  any  of  the  data  sets.  The  EGG  begins  to  increase  from 
its  minimum  while  still  in  the  closed  phase,  reflecting  the 
separation  of  the  folds  from  the  inferior  surfaces  towards 
the  upper  margins. 
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The  observed  shape  of  the  EGG  in  this  period  is 
typically  parabolic,  implying  that  the  depth  of  contact  of 
the  folds  is  continuously  changing.  Most  of  the  examples  in 
Figures  3.6-3.9  show  this  behavior.  Occassi onal ly ,  the  EGG 
has  an  "almost"  flat  region  during  which  the  depth  of 
contact  is  presumably  constant.  An  example  is  shown  in 
Figure  3.8(a). 

Opening  Phase.  The  opening  phase  is  defined  in  terms 
of  the  glottal  area  as  the  duration  from  glottal  opening  to 
the  maximum  value  of  the  glottal  area.  The  glottal  area 
during  this  phase  increases  monotoni cal ly  to  its  maximum. 
Observation  of  the  corresponding  high  speed  film  frames 
reveal  that  the  initial  glottal  opening  is  gradual  with 
large  horizontal  phase  differences.  Thus  it  may  take 
several  film  frames  for  the  glottal  opening  to  spread  to  the 
entire  length  of  the  folds.  Further  increase  in  the  glottal 
area  is  brought  about  by  the  folds  moving  apart  with  no 
change  in  the  lateral  contact  between  them.  The  EGG 
consequently  shows  two  distinct  phases.  In  the  first,  the 
EGG  increases  monotoni cal ly  reflecting  the  decreasing 
lateral  contact  between  the  folds.  Once  the  folds  have 
separated,  the  EGG  remains  constant  while  the  folds  pull 
apart  further. 

This  description  of  the  EGG  during  the  opening  phase  is 
consistent  with  the  observations  of  Baer  et  al  (39). 

Closing  Phase.    The  closing  phase  is  defined  as  the 
duration  from  the  maximum  glottal  area  to  the  instant  of 
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glottal  closure.  The  area  decreases  monotoni cal ly  to  zero 
during  this  time  and  is  usually  symmetric  with  respect  to 
the  opening  phase.  The  movements  of  the  vocal  folds, 
however,  reveal  a  basic  asymmetry  between  the  opening  and 
the  closing  phases.  Over  a  large  portion  of  the  closing 
phase,  the  vocal  folds  adduct  towards  their  medial  position 
with  little  or  no  change  in  the  length  of  contact  along  the 
midsagittal  line.  Just  prior  to  closure,  the  vocal  folds 
are  almost  parallel  with  a  narrow  opening  along  their  entire 
length.  Closure  occurs  almost  simultaneously  along  the 
entire  midsagittal  line.  Thus  while  the  glottal  area  does 
not  reflect  this  fact,  the  glottal  closure  is  an  abrupt 
phenomenon . 

The  EGG,  as  a  result,  again  has  two  distinct  phases. 
In  the  first,  the  EGG  continues  to  maintain  a  constant  value 
while  the  vocal  folds  come  together  without  contact.  Then 
comes  the  characteristic  rapid  fall  in  the  EGG  corresponding 
to  the  almost  simultaneous  contact  along  the  length  of  the 
folds.  This  is  perhaps  the  most  consistently  observed 
feature  in  the  EGG  during  normal,  chest  voice  phonation. 
The  experiments  of  Baer  et  al  (39),  Rothenberg  (37),  Fourcin 
(34)  and  Lecluse  (36)  agree  on  this  point. 

Rothenberg  (37)  has  suggested  that  the  EGG  may  be 
influenced  by  the  electrical  capacitance  of  the  glottal 
opening,  particularly  when  the  folds  are  close  but  not 
touching.  As  explained  above,  this  is  typically  the  case 
just  prior  to  glottal  closure.   The  EGG,  however,  does  not 
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show  any   changes   during   this   time,   implying   that  the 

capacitive  effects  are  not  significant. 

Determination  of  the  Opening  and  Closing  Instants  from  the 
TS1T  — 2 — 

The  description  of  the  EGG  in  the  previous  section  made 
no  reference  to  features  in  the  EGG  that  mark  the  instants 
of  glottal  opening  and  closure.  One  of  the  important 
applications  of  the  EGG  is  in  automating  analysis  of  vocal 
fold  behavior.  It  is  therefore  necessary  to  define  suitable 
operational  techniques  to  locate  these  instants  from  the 
EGG.  The  usefulness  of  the  definition  must  be  validated  by 
comparing  the  results  obtained  by  using  the  EGG  against 
other  standard  techniques. 

While  previous  studies  have  located  features  in  the  EGG 
that  correspond  to  glottal  opening  and  closure,  they  do  not 
enable  a  unique  determination  of  these  times.  For  example, 
glottal  closure  is  said  to  occur  during  the  "rapid  fall"  in 
the  EGG.  Since  this  fall  can  span  several  film  frames,  just 
which  of  these  corresponds  to  glottal  closure?  The  case  of 
the  glottal  opening  is  worse  since  the  corresponding 
feature,  a  slope  discontinuity  in  the  EGG,  need  not  even  be 
present  in  the  waveform. 

The  discussion  so  far  points  to  the  rate  of  change  of 
the  EGG  with  time  as  a  better  candidate  for  locating  the 
glottal  opening  and  closing  instants  rather  than  the  EGG 
itself.  The  time  differentiation  for  sampled-time  data,  as 
used  in  this  study,  can  be  easily  approximated  by  the 
discrete  time  filter, 

H(z)  =  1  -  z"1. 
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The  differentiated  EGG  (Diff  EGG)  is  included  in  the 
synchronized  data  plots  of  Figures  3.6-3.9.  Thirty  out  of 
the  thirty  six  data  sets  used  in  this  study  show  similar 
waveforms.  The  six  data  sets  that  do  not  fit  in  this 
category  appear  to  be  examples  of  breathy  phonation  in  which 
the  vocal  folds  vibrate  without  any  significant  contact 
between  them. 

Closing  Instant.  As  was  explained  earlier,  closure 
occurs  almost  simultaneously  along  the  length  of  the  folds 
and  the  EGG  decreases  rapidly  during  this  time.  This  rapid 
fall  is  typically  less  than  0.6  ms  in  duration.  The  Diff 
EGG  has  a  sharp  negative  spike  that  corresponds  to  the 
greatest  rate  of  decrease  of  the  EGG.  The  instant  of 
glottal  closure  is  operationally  defined  for  this  study  as 
the  minimum  in  the  Diff  EGG  during  a  voice  period.  The 
rapidity  of  glottal  closure  ensures  that  this  feature  is 
usually  within  0.6  ms  of  the  actual  closure  instant. 

Opening  Instant.  Our  earlier  discussions  have  pointed 
out  that  the  EGG  changes  slowly  during  the  glottal  opening 
phase  because  of  the  horizontal  phase  differences  associated 
with  the  opening.  Examination  of  the  synchronized  data 
plots  in  Figures  3.6-3.9  shows  that  the  Diff  EGG  is  maximum 
very  close  to  the  instant  of  glottal  opening.  This  instant 
typically  corresponds  to  a  point  of  inflection  in  the  EGG, 
where  it  changes  from  a  concave  upwards  to  a  concave 
downwards  curve.  When  a  slope  discontinuity  is  present  in 
the  EGG,  our  observation  has  been  that  the  point  of  slope 
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discontinuity  is  also  such  an  inflection  point.  Now  the 
second  derivative  of  a  function  is  either  zero  or  does  not 
exist  at  a  point  of  inflection  (40).  A  second  observation 
is  that  the  second  derivative  of  the  EGG  does  not  exist  at 
this  point  of  inflection  in  the  EGG.'  We  illustrate  these 
remarks  with  the  sketches  shown  in  Figure  3.10. 

The  close  correspondence  between  the  maximum  in  the 
Diff  EGG  (which  also  happens  to  be  the  point  of  inflection) 
and  the  glottal  opening  is  observed  very  consistently  in  all 
the  data  sets  we  have  studied.  The  EGG  and  Diff  EGG 
waveforms  in  (41)  also  fit  this  model.  Other  researchers 
have  not  used  the  Diff  EGG,  but  our  description  of  the  EGG 
during  glottal  opening  appears  applicable  to  the  waveforms 
published  in  (37)  also. 

Based  on  this  discussion,  we  define  the  opening  instant 
as  the  maximum  in  the  Diff  EGG  during  a  glottal  period. 

Period  and  Open  Quotient.  Once  the  opening  and  closing 
instants  have  been  determined,  the  pitch  period,  T,  is 
defined  as  the  time  duration  between  two  successive  closing 
i  nstants. 

The  open  quotient  (0.Q)  is  defined  as 

0  q  _  duration  of  the  open  phase 
pitch  period. 

Note  that  the  opening  and  closing  instants  have  been 
defined  as  the  maximum  and  the  minimum  in  the  Diff  EGG  in  a 
single  glottal  period.  Given  an  EGG  record  containing 
several  glottal  periods,  the  value  of  the  maximum  and  the 
minimum  of   the   Diff   EGG   need   not   be   the   same  in  the 
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Figure  3.10  Illustration  of  EGG  during  glottal  opening 
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different  periods.   A  method  for  automatically  locating  the 
EGG  opening  and  closing  instants  for  all  the  periods  in  the 
record  has  been  implemented  and  can  be  described  by  the 
algorithm  given  below. 
Algorithm  EGG-C1 osed-Open 

Let  the  EGG  record  be  EGG( 1 ) . . . EGG( NUMB) . 

1.  Remove  the  mean  from  the  EGG  record. 

2.  Differentiate  the  EGG  using  the  filter 

H(z)  =  1  -z"1. 

3.  Locate  the  positive  to  negative  zero  crossing 
instants  and  negative-to-positive  zero  crossing  instants 
in  the  EGG.  Label  these  PN(i)  and  NP(j)  respectively. 
Note  that  the  simple  form  of  the  EGG  ensures  that  either 

NP(1)  <  PN(1)  <  NP(2)  <  PN(2)  <  ...  <  NP(K)  <  PN(RH)  _0R_ 
PN(1)  <  NP(1)  <  ...  <  PN(K)  <  NP  (RH)  <  ...  etc. 

depending  on  whether  the  record  starts  with  a 
positive-to-negative  or  negative-to-positive  zero 
crossing.  Let  n  =  number  of  positive-to-negative  zero 
crossings  and  m  =  number  of  negative-to-positive  zero 
crossings.   Note  that  |n  -  m|  <  2. 

4.  Ini  ti  al i  zati  on  : 
If  NP(1)  <  PN(1) 

Then  locate  a  maximum  in  the  Diff  EGG  between  Diff 
EGG(l)  and  Diff  EGG(PN(1)).  Label  this  instant 
0PEN(1);  it  is  the  first  opening  instant. 
Else,  locate  a  minimum  in  the  Diff  EGG  between  Diff 
EGG(l)  and  Diff  EGG(NP(1)).  Label  this  instant 
CLOSE(l);  it  is  the  first  closure  instant. 
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5.  The  loop: 

For  i  =  1,  ...,  n  -  1, 

locate  a  closure  instant  CLOSE(i)  as  the  location 
of  the  minimum  in  Diff  EGG  between  Diff 
EGG(0PEN(i))  and  Diff  EGG(NP(i  )). 
For  i=l,  . . . ,  n  -  1  , 
locate  an  opening  instant  OPEN(i)  as  the  location 
of  the  maximum  in  Diff  EGG  between  Diff 
EGG(CLOSE(i  ))  and  PN(i  ). 

6.  End . 

Note  that  this  algorithm  locates  the  opening  and 
closing  instants  sequentially  and  uses  only  zero  crossing 
and  peak  picking  information.  It  is  therefore  capable  of 
real  time  implementation. 

The  opening  and  closing  instants  also  need  to  be 
determined  from  the  glottal  area  function  to  compare  the  two 
methods,  EGG  based  and  area  based.  A  similar  algorithm  to 
determine  the  opening  and  closing  instants  from  the  area  has 
also  been  implemented. 
Algorithm  AREA-C1 osed-Open 

1.  Locate  the  peaks  in  the  glottal  area  record.   Let 
the  number  of  peaks  =  n. 

2.  For  i=l  ,2,. .  ,n-l 
do 

i)    locate  the  minimum  glottal  area  between  the  glottal 
area  peaks  i  and  1+1. 
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ii)   let  Aj  =  area  of  glottal  peak  i 

A2  =  area  of  glottal  peak  1+1 

M  =  minimum  glottal  area  between  peaks 
Set  the  threshold  as 


A  +  A 
THRES  =  0.1*  (-i 2  -  M)  +  M. 


iii)   locate  the  closing  instant  as  the  index  j  such  that 
Area  (j-1)  >  THRES  and  Area  (j)  <  THRES 
locate  the  opening  instant  as  the  index  K  such  that 
Area  (K)  <  THRES  and  Area  (K)  >  THRES. 

3.  End  effects  : 

i)  locate  a  closing/opening  instant  between  the  start 
and  the  first  area  peak  (if  possible). 

ii)  locate  a  closing/opening  instant  between  the  last 
area  peak  and  the  end  (if  possible). 

4.  Return. 
Resul ts 

A  computer  program  has  been  implemented  that 
incorporates  the  above  two  algorithms  and  automatically 
computes  the  errors  in  locating  the  opening  instant,  the 
closing  instant  and  in  computing  the  period  and  O.Q.  from 
the  EGG.  The  values  of  these  variables  obtained  from  the 
glottal  area  are  used  as  the  reference.  The  program  also 
plots  out  the  synchronized  EGG,  Diff  EGG  and  glottal  area 
with  the  relevant  points  marked  to  allow  the  researcher  to 
verify  that  the  algorithms  have  indeed  performed  correctly. 
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This  program  was  run  once  for  each  of  the  36  data  sets, 
once  comparing  the  area  with  the  EGG  traced  from  the  film 
and  once  comparing  the  area  with  the  EGG  recorded  on  the 
audio  tape.  As  was  explained  in  Chapter  2,  the 
synchronization  between  the  traced  EGG  and  the  glottal  area 
is  very  good;  hence  this  was  chosen  for  further  analysis. 
In  some  data  sets,  however,  the  EGG  trace  went  off  the  film 
or  else  the  algorithm  made  obvious  errors.  In  such  cases, 
the  glottal  area--EGG  off  audio  tape  comparison  was  used. 

Since  the  closing  and  opening  instants  in  the  area  are 
well  defined  only  when  complete  glottal  closure  exists,  only 
such  tasks  (22  out  of  the  36  in  the  data  base)  have  been 
included  in  the  analysis  of  opening  and  closing  instants 
determination  error. 

These  results  have  been  summarized  in  the  form  of  a 
series  of  tables  and  figures. 

Opening  instant  error.  The  error  in  determining  the 
opening  instant  from  the  EGG  as  compared  with  the  opening 
instant  determined  from  the  glottal  area  for  each  of  the 
four  subjects  in  this  study  is  shown  in  Tables  3.1,  3.3,  3.5 
and  3.7.  The  distribution  of  this  error  is  shown  in  Figure 
3.11. 

Figure  3.11  reveals  that  while  the  error  was  less  than 
eight  samples  (0.8  ms)  in  most  cases,  there  are  two  examples 
where  the  error  is  more  than  twelve  samples  (1.2  ms).  Also, 
the  error  shows  some  subject  dependency. 
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Closing  instant  error.  The  error  in  locating  the 
closing  instant  for  the  four  subjects  is  shown  in  Tables 
3.2,  3.4,  3.6  and  3.8.  Figure  3.12  shows  the  distribution 
of  the  error.  It  is  seen  that  the  closing  instant  was 
located  with  an  error  of  less  than  six  samples  (0.6  ms )  in 
most  cases. 

Pitch  period  measurement.  The  pitch  as  measured  from 
the  EGG  and  the  glottal  area,  and  the  error  are  shown  in 
Tables  3.9-3.12.  The  error  distribution  is  summarized  in 
Figure  3.13.  It  is  seen,  as  might  be  expected,  that  the  EGG 
is  an  excellent  signal  for  the  measurement  of  pitch, 
typically  involving  less  than  0.5%  error  in  the  measurement 
as  compared  with  those  obtained  from  the  glottal  area. 

O.Q.  measurement.  The  open  quotients  measured  from  the 
EGG  and  the  glottal  area,  and  the  error  in  the  EGG 
measurement  are  shown  in  Tables  3.13-3.16  and  the  error 
distribution  summarized  in  Figure  3.14.  Note  the  strong 
subject  dependency  of  the  error. 

Discussion.  The  results  obtained  show  that  the  EGG  is 
an  excellent  signal  for  locating  the  closing  instants  of  the 
vocal  fold  vibration  and  for  determining  the  vibration 
period.  While  it  gives  a  good  indication  of  the  region 
where  opening  occurs,  the  present  algorithm  is  not  very 
effective  in  locating  the  exact  instant.  Consequently,  the 
O.Q.  measured  from  the  EGG  also  shows  large  errors. 
Moreover,  such  errors  appear  to  have  a  subject  dependency. 
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TABLE  3.1   ERROR  IN  DETERMINING  CLOSING  INSTANT 
IN  NUMBER  OF  SAMPLES 


SUBJECT:   JMN 


\.   FREQ 

INT  \. 

125 

170 

340 

LOW 

4.00 

3.2 

N/A 

MED 

2.00 

14.2 

N/A 

HIGH 

6.25 

3.4 

N/A 

TABLE  3.2   ERROR  IN  DETERMINING  CLOSING  INSTANT 
IN  NUMBER  OF  SAMPLES 


SUBJECT:   JMN 


X.   FREQ 
INT  \^ 

125 

170 

340 

LOW 

1.00 

2.00 

N/A 

MED 

2.25 

3.5 

N/A 

HIGH 

1.67 

3.4 

N/A 
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TABLE  3.3   ERROR  IN  DETERMINING  OPENING  INSTANT 
IN  NUMBER  OF  SAMPLES 


SUBJECT:   DMK 


\v  FREQ 
INT  \. 

125 

170 

340 

LOW 

4.33 

3.80 

3.6 

MED 

2.00 

3.20 

2.6 

HIGH 

3.00 

0.600 

N/A 

TABLE  3.4   ERROR  IN  DETERMINING  CLOSING  INSTANT 
IN  NUMBER  OF  SAMPLES 


SUBJECT:   DMK 


^\  FREQ 
INT  \. 

125 

170 

340 

LOW 

1.00 

3.48 

1.6 

MED 

3.  50 

0.00 

2.1 

HIGH 

1.25 

2.20 

N/A 
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TABLE  3.5   ERROR  IN  DETERMINING  OPENING  INSTANT 
IN  NUMBER  OF  SAMPLES 


SUBJECT:   AKK 


\.   FREQ 
INT    X. 

125 

170 

340 

LOW 

11.5 

7.6 

12.25 

MED 

11.4 

4.83 

N/A 

HIGH 

6.5 

N/A 

7.3 

TABLE  3.6   ERROR  IN  DETERMINING  CLOSING  INSTANT 
IN  NUMBER  OF  SAMPLES 


SUBJECT:   AKK 


\v   FREQ 
INT   x. 

125 

170 

340 

LOW 

0.667 

0.75 

2.40 

MED 

1.00 

2.86 

N/A 

HIGH 

0.5 

N/A 

1.9 
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TABLE  3.7   ERROR  IN  DETERMINING  OPENING  INSTANT 
IN  NUMBER  OF  SAMPLES 


SUBJECT:   GPM 


\.  FREQ 
INT   X. 

125 

170 

340 

LOW 

N/A 

4.25 

4.1 

MED 

N/A 

N/A 

3.1 

HIGH 

3.49 

5.28 

1.14 

TABLE  3.8   ERROR  IN  DETERMINING  CLOSING  INSTANT 
IN  NUMBER  OF  SAMPLES 


SUBJECT:   GPM 


X.  FREQ 
INT  \. 

125 

170 

340 

LOW 

N/A 

4.00 

2.0 

MED 

N/A 

N/A 

0.9 

HIGH 

7.56 

6.00 

3.71 
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Another  phenomenon  observed  is  that  for  the  same  task, 

the  periods  and  O.Q.s  measured  from  the  glottal  area  for 
each  vibratory  cycle  show  more  variation  than  those  measured 

from  the  EGG.   The  implications  of  this  have  not  yet  been 
pursued. 

EGG  and  the  Length  of  Glottal  Contact 


We  have  mentioned  earlier  that  on  frame  by  frame 
projection  of  the  high  speed  films  of  the  vocal  folds  it  is 
seen  that  there  exists  phase  differences  along  the  length  of 
the  folds  during  the  opening  and  closing  phases,  i.e., 
during  the  closing  (opening)  phase,  contact  (opening) 
between  the  folds  first  occurs  over  a  small  portion  of  its 
length.  In  succeeding  frames  this  contact  (opening) 
proceeds  "zipper"  like  along  the  length  of  the  folds  until 
the  whole  glottis  is  closed  (open).  This  behavior  is  more 
pronounced  during  opening  than  closing  phases. 

Now,  the  lateral  area  of  contact  between  the  vocal 
folds  changes  in  two  dimensions,  along  the  length  of  the 
folds  and  also  along  their  thickness.  However,  we  can 
assume  as  a  first  approximation  that  the  depth  of  contact 
does  not  change  appreciably  during  the  period  of  time  when 
an  initial  glottal  opening  spreads  to  the  entire  length  of 
the  folds.  Then,  the  lateral  area  of  contact  is 
proportional  to  the  1 ength  of  contact  along  the  top  margins 
of  the  vocal  folds.  A  similar  remark  also  applies  during 
cl osure. 
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Since  the  EGG  is  presumably  proportional  to  the  inverse 
of  the  lateral  area  of  contact,  the  conjecture  is  that 
during  the  opening  phase  the  EGG  is  proportional  to  the 
length  of  the  glottal  opening.  Smith  (15)  and  Childers, 
Smith  and  Moore  (4)  found  good  correlation  between  the  EGG 
and  the  length  of  the  glottal  opening. 

We  compare  the  EGG  and  the  glottal  opening  length  in 
Figures  3.15-3.19.  These  plots  are  arranged  as  follows: 
The  first  graph  is  of  the  length  of  the  glottal  opening, 
normalized  to  be  between  0  and  1.  The  second  graph  is  the 
EGG.  In  the  third  graph,  the  length  of  the  glottal  opening 
and  the  EGG  are  superposed.  According  to  our  arguments,  the 
EGG  is  proportional  to  the  glottal  opening  length  only 
during  the  open  phase.  Thus  in  the  third  graph,  the  portion 
of  the  EGG  corresponding  to  the  closed  phase  has  not  been 
plotted.  Also,  the  EGG  during  the  open  phase  has  been 
scaled  to  be  between  0  and  1  in  each  period. 

These  figures  show  that  in  most  of  the  examples,  the 
EGG  and  the  glottal  opening  length  correlate  very  well.  The 
rising  portion  of  the  EGG  during  opening,  the  flat  portion 
corresponding  to  no  contact  and  the  steep  closing  portion 
agree  with  the  corresponding  phases  in  the  length.  There  is 
another  important  observation  to  be  made--in  some  of  the 
data  sets  with  a  closed  phase  (Figures  3.17,  3.18  and  3.19), 
it  is  seen  that  the  EGG  has  a  smaller  value  at  the  instant 
of  glottal  opening  than  at  the  instant  of  glottal  closure. 
In  other  words,  for  the  same  length  of  contact  between  the 
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folds,  the  impedance  across  the  folds  is  smaller  during 

opening   than   during   closure.     If   we   assume   that   the 

electrical  properties  of  the  contacting  surfaces  is  the  same 

during  the  opening  and  closing  phases,  this  implies  that  the 

thickness  of  the  contacting  region  is  much  larger  during  the 

opening  than  the  closing  phase.   However,  this  is  in  direct 

contradiction  to  what  has  been  observed  in  practice.   To 

quote  Baer  , 

Glottal  closure  also  exhibited  wavelike  properties. 
Tissues  at  the  lower  edge  of  closure  were  peeled  apart, 
while  tissues  above  the  point  of  closure  were  still  coming 
together.  The  depth  of  closure  was  almost  negligible 
immediately  before  the  glottis  opened. (26,  page  40) 

Even  on  the  observations  of  the  high  speed  films,  it  is 
seen  that  just  before  opening,  the  texture  and  reflectance 
of  the  contacting  surface  show  a  change  that  leads  one  to 
believe  that  the  depth  of  closure  is  very  small. 

Thus,  if  the  depth  of  closure  is  in  fact  smaller  at  the 
opening  instant,  then  the  lower  impedance  must  be  due  to  a 
higher  conductivity  of  the  contacting  surfaces. 

Observations  of  the  films  show  that  in  many  instances, 
the  last  layer  of  the  vocal  folds  to  separate  during  the 
opening  phase  is  the  free  mucus  on  the  surface  of  the 
folds.  After  repeated  observations  of  some  of  the  films  in 
the  data  base,  we  are  convinced  that  the  mucus  is  indeed 
responsible  for  the  lower  impedance  during  the  opening 
phase.   This  point  is  taken  up  further  in  the  next  section. 
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38        68        98 

SUBJ  :  DMK         125  Hz,  77  dB 

Figure  3.15(a) 
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SUBJ    :    DMK  178    Hz,    74    dB 

Figure  3.15(b) 

Figure  3.15  Synchronized  length  of  glottal  opening  and  EGG, 

Subj:  DMK 
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SUBJ    :    DMK  340    Hz,    63    d3 

Figure  3.16(a) 
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Figure  3.16  Synchronized  length  of  glottal   opening  and  EGG, 

Subjs:   DMK  and  JMN 
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SUBJ  :  JMN         176  Hz,  75  dB 

Figure  3.17(a) 
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38  68  98 

SUBJ     :    JMN  179    Hz,    72    dB 

Figure  3.17(b) 

Figure  3.17  Synchronized  length  of  glottal  opening  and  EGG, 

Subj:  JMN 
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SUEJ  :  GPM         176  Hz,  64  dB 

Figure  3.18(a) 


246 


278  388 

HS    X    18 


TT1  |!fllll)t!'»TT 


^": >*\"" 


>»■>■  imt! 


38  68  98 

SUBJ  :  GPM         348  Hz,  73  dB 

Figure  3.18(b) 

Figure  3.18  Synchronized  length  of  glottal  opening  and  EGG, 

Subj:  GPM 
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SUB J     :    flKK  196    Hz,    72    dB 

Figure  3.19(a) 
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SUBJ  :  flKK         17Q  Hz,  63  dB 

Figure  3.19(b) 

Figure  3.19  Synchronized  length  of  glottal  opening  and  EGG, 

Subj:  AKK 
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EGG  and  Observations  of  the  High  Speed  Fi 1 


The  final  part  of  the  study  correlating  the  EGG  and  the 
ultra-high  speed  films  was  the  frame-by-frame  visual 
observation  of  the  films  to  locate  events  which  may  be 
responsible  for  the  shape  of  the  EGG  waveform. 

Two  complete  vibratory  cycles  were  selected  for  each 
task.  Then,  from  the  plot  of  the  synchronized  EGG  trace, 
the  film  frame  coresponding  to  the  opening  instant 
determined  from  the  EGG,  the  frames  corresponding  to  the 
knees  around  the  flat  top  of  the  EGG  and  the  closing  frame 
from  the  EGG  were  determined.  The  high  speed  film  was 
projected  onto  a  screen  using  a  stop-frame  projector  and  the 
vibratory  behavior  observed  during  these  frames  noted 
down.  Figure  3.20  illustrates  these  remarks.  This  results 
in  a  table  of  observations  of  the  form  shown  in  Table 
3.17.  The  observations  from  several  such  tables  were 
collected  to  form  the  tables  shown  in  Tables  3.18-3.20. 

Perusing  these  three  tables  along  the  rows 
corresponding  to  EGG  opening  and  EGG  knee  it  is  seen  that 

i)  In  6  of  the  12  tasks,  the  EGG  knee  before  the  flat 
open  phase  coicided  with  a  break  in  a  strand  of  free  mucus 
stretching  between  the  folds. 

ii)  In  6  of  the  12  tasks,  the  maximum  in  the  Diff  EGG 
coincided  with  a  frame  in  which  there  is  some  form  of  change 
in  a  mucus  bridge  across  the  folds. 

The  effect  of  the  mucus  on  the  EGG  is  particularly 
apparent  in  one  of  the  tasks,  Subj :  JMN,  170  Hz, 72  dB.   The 
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EGG 


Diff. 
EGG 


__v 


A  :  EGG  knee 

B  :  EGG  knee 

C  :  EGG  closing  instant 

D  :  EGG  opening  instant 

A,B,C,D  are  the  EGG  events  chosen  for  detailed  observation  on  the 
high  speed  film 


Figure  3.20  EGG  events  chosen  for  film  observation 
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TABLE  3.17   TABLE  OF  OBSERVATIONS  FOR 
SUBJ:   JMN,  TASK:   170  Hz,  72  dB 


EGG  FEATURE    LENGTH/AREA  FEATURE    FILM  FRAMES  DESCRIPTION 


1 .   Openi  rig  , 
frame  12.5 


2.   Knee, 
frame  15 . 5 


1 .   Openi  ng 
frame  5 


2.   Knee  in 
length,  frame 
15 


1.  Small  posterior 
opening  has  started 
as  early  as  frame  2. 
Between  frames  12 
and  13  a  1 arge  mucus 
bridge  at  the  vocal 
process  begins  to 
separate . 

2.  The  mucus  strand 
at  the  vocal  process 
breaks  in  frames 
15-16. 


3.   Knee, 
frame  27.5 


3.   Knee  in 
length,  frame 
27 


3.   First  lateral 
contact  between  the 
f ol ds  occu  rs  i  n 
frames  27-28. 


4.  CI osure, 
frame  29 

5.  Openi  ng , 
frame  43.5 


4.   CI osure, 
frame  28 

5   Opening, 
frame  35 


4.  Complete  glottal 
closure  by  frame  28. 

5.  Posterior 
opening  present  from 
30.   Change  in  mucus 
bridge  frames  43-44 
as  described  in  1. 


6.   Knee, 
frame  46.5 


6.   Knee, 
frame  44 


6.   Mucus  strand 
breaks,  frame  46. 


7.   Knee, 
frame  58 


7.   Knee, 
frame  58 


7.   First  contact 
between  folds  is  in 
frame  58. 


8.   CI osure, 
frame  59 


8.   CI osure, 
frame  59 


8.   CI osure  occurs 
in  frame  59. 


9 .   Opening, 
frame  73.5 


9.   Openi  ng , 
frame  67 


9.   Opening  has 
started  frame  63, 
same  comments  as 
in  1 . 
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table  of  observations  for  this  task  is  shown  in  Table 
3.17.  The  length  of  the  glottal  opening  and  the  EGG  for 
this  task  are  shown  in  Figure  3.17(b).  An  examination  of 
this  table  and  the  corresponding  high  speed  film  is  a 
convincing  demonstration  of  this  fact. 

AQuali tati ve  Model  for  the  EGG 


The  last  three  sections  are  the  report  of  an 
experimental  study  comparing  the  EGG  and  vocal  fold 
vibration  as  deduced  from  high  speed  films.  Now,  our 
understanding  of  both  vocal  fold  vibration  and  the  EGG  is 
insufficient  to  completely  describe  the  various  types  of  EGG 
observed,  or  even  all  the  features  in  a  given  EGG  record. 
Nevertheless,  the  results  presented  in  this  chapter  allow  us 
to  describe  an  "ideal"  EGG  signal  that  incorporates  all  the 
features  that  have  been  consistently  observed.  This  is 
indeed  the  purpose  of  the  Rothenberg  model  of  Figure  3.5. 

The  model  we  present  below  is  a  refinement  of  the 
Rothenberg  model.  A  schematic  representation  of  our  model 
is  shown  in  Figure  3.21.  The  discussion  to  follow  is  with 
reference  to  this  figure. 

In  Figure  3.21,  A-B  is  the  period  of  time  in  the 
glottal  opening  phase  when  the  vocal  folds  are  moving  apart 
increasing  the  glottal  area.  There  is  no  contact  between  the 
folds  and  the  EGG  is  consequently  a  constant. 

During  B-C,  the  folds  are  coming  together  decreasing 
the  glottal  area,  but  first  contact  between  the  folds  occurs 
only  at  C.   The  EGG  is  thus  constant  during  B-C. 
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The  interval  C-D  corresponds  to  the  rapid  closure  of 
the  folds  along  their  length  and  at  time  D,  the  projected 
glottal  area  becomes  zero.  The  EGG  decreases  rapidly  during 
this  time,  and  the  large  negative  spike  in  the  Diff  EGG 
occurs  very  close  to  time  D. 

The  interval  D-F  is  the  glottal  closed  phase  with  zero 
glottal  area.  The  EGG  decreases  during  the  initial  portion, 
D-E,  of  the  closed  phase  reflecting  the  increasing  depth  of 
contact  between  the  folds.  At  E  the  folds  reach  maximal 
lateral  contact  and  subsequently  begin  pulling  apart  at  the 
lower  margins.  This  causes. the  EGG  to  increase  between  E  and 
F.  The  Diff  EGG  also  increases  during  this  time--thus  the 
EGG  increases  with  an  increasing  slope;  i.e.,  it  is  concave 
upwards . 

The  point  F  corresponds  to  the  first  appearance  of  the 
glottal  opening  on  the  upper  margins  of  the  folds.  Usually, 
but  not  always,  this  coincides  with  a  discontinuity  in  the 
slope  of  the  EGG.  Between  F  and  G  the  glottal  opening 
spreads  along  the  length  of  the  folds,  decreasing  the  amount 
of  lateral  contact.  The  EGG  consequently  increases 
monotonically ;  however,  the  Diff  EGG  is  now  decreasing  and 
so  the  EGG  is  concave  downwards  from  F.  The  EGG  has  an 
inflection  point  at  F. 

After  time  G,  the  folds  are  no  longer  in  contact  and 
the  EGG  remains  constant.   The  cycle  then  repeats  itself. 
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Figure  3.21  A  qualitative  model  for  the  EGG 
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Concl usi  ons 


This  chapter  compared  the  el  ect rogl  ottograph  signal 
with  simultaneously  obtained  ultra-high  speed  films  of  the 
vocal  folds.  The  comparisons  indicate  that  the  EGG  is 
indicative  of  lateral  glottal  contact.  The  experiments  of 
Smith  (42,43)  that  allegedly  show  that  the  EGG  registers 
acoustic  and  mechanical  effects  appear  incorrect. 

The  behavior  of  the  EGG  during  the  different  glottal 
phases  was  described.  An  algorithm  for  determining  the 
instants  of  glottal  opening  and  closure  from  the  Diff  EGG 
was  described  and  evaluated  by  comparing  against  the  glottal 
area.  The  O.Q.  and  period  computed  from  the  EGG  were  also 
compared  against  the  values  determined  from  the  glottal 
area.  The  results  indicate  that  the  EGG  provides  an 
accurate  determination  of  the  closing  instant  and  the  voice 
period.  The  determination  of  the  instant  of  glottal  opening 
is  not  as  reliable,  but  is  typically  within  0.8  ms  of  the 
corresponding  instant  determined  from  the  glottal  area. 

It  was  pointed  out  that  the  EGG  is  affected  by  mucus 
strands  bridging  the  folds.  These  appear  to  provide  a 
highly  conductive  path  for  the  radio  frequency  signal  used 
in  the  EGG. 

Finally,  a  qualitative  model  for  the  EGG  was  presented. 


CHAPTER  4 

SYNCHRONIZED  GLOTTAL  VOLUME  VELOCITY, 
GLOTTAL  AREA  AND  THE  EGG 


Introducti  on 


The  vibration  of  the  vocal  folds  and  the  relationship 
between  this  vibration  and  the  EGG  was  studied  in  the  last 
chapter.  Here  the  concern  is  with  correlating  the  EGG  and 
the  acoustic  consequence  of  vocal  fold  vibration,  namely  the 
glottal  sound  source. 

The  periodic  vibrations  of  the  vocal  folds  cause 
"puffs"  of  air  to  flow  into  the  supraglottal  cavities.  This 
airflow,  or  glottal  volume  velocity,  is  then  shaped  by  the 
acoustic  vocal  tract  filter  and  radiated  as  sound  at  the 
lips.  The  waveform  of  the  glottal  volume  velocity  (v-v)  is 
an  important  variable  in  determining  the  properties  of  the 
radiated  speech  wave  and  is  therefore  fundamental  to  all 
investigations  of  the  speech  production  process. 

While  it  was  proposed  as  early  as  the  18  30 '  s  that  the 

vocal  folds  act  as  a  harmonic  generator,  the  exact  waveshape 

of  the  v-v  was  obtained  only  in  the  1950's,  when  the  vocal 

tract  was  finally  understood  and  analysed  as  an  acoustical 

system  (44).   The  reason  for  this  is  that  the  glottal  v-v  is 

not  easily  transduced,  but  rather  has  to  be  somehow  inferred 

from  the  output  at  the  mouth.   This  entails  some  method  of 

"cancelling  out"  or  inverse   filtering   the   effects,  of  the 
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vocal  tract  from  the  mouth  output.  Since  the  vocal  tract 
filter  is  also  unknown,  this  too  has  to  be  estimated  from 
the  speech  signal.  One  possible  way  is  to  assume  a 
parametric  model  for  the  vocal  tract,  estimate  the 
parameters  using  the  speech  wave,  and  then  use  this  derived 
model  to  inverse  filter  the  speech.  Details  of  this 
technique,  and  a  number  of  related  techniques  are  presented 
in  the  next  two  sections. 

There  are  several  reasons  for  studying  synchronized  EGG 
and  glottal  v-v,  and  it  is  appropriate  to  introduce  them  at 
this  point.  Firstly,  a  common  problem  in  all  the  currently 
used  methods  of  inverse  filtering  is  deciding  when  the 
method  has  performed  correctly,  i.e.,  deciding  when  the 
inverse  filter  and  the  estimated  glottal  v-v  are  indeed  the 
true  ones.  Typically,  this  decision  is  based  on  the 
presence  or  lack  of  certain  "expected"  features  in  the 
resulting  v-v  waveform.  The  EGG,  being  an  independently 
obtained  signal,  can  provide  an  objective  basis  for  making 
this  decision  J_f_  one  knows  how  features  in  the  EGG  and  the 
v-v  are  related  (37).  Carrying  this  arguement  further,  it 
may  be  possible  that  the  EGG  can  be  used  in  automating  the 
inverse  filtering  itself  (37). 

The  second  motivation,  related  to  the  first  has  to  do 
with  the  difficulty  of  inverse  filtering.  The  glottal  v-v 
has  a  significant  influence  on  the  quality  of  the  voice 
produced  (45,46).  Holmes  (7)  and  Yea  (47)  have  shown  that 
using  a  glottal  excitation  close  to  the  true  glottal   v-v 
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can  greatly  improve  the  quality  of  the  speech  produced  by 
speech  synthesizers.  Thus,  in  situations  such  as  a  clinical' 
environment,  voice  or  singing  training,  vocoding,  etc., 
information  about  the  glottal  sound  source  is  desirable,  but 
is  precluded  because  of  the  difficulty  or  i nappropriateness 
of  inverse  filtering.  The  question  then  arises:  Can  the 
EGG  supply  any  of  the  information  desired? 

Finally,  one  of  the  long-term  goals  of  the  research  in 
laryngeal  and  voice  source  dynamics  is  that  of  deducing  the 
motions  of  the  vocal  folds  from  a  set  of  simultaneously 
transduced  gl ottographi c  waveforms  such  as  the  glottal  area, 
the  EGG  and  the  glottal  v-v  (48).  Establishing  experimental 
correlations  between  synchronized  gl ottographi c  waveforms  is 
a  first  step  in  this  project. 

An  additional  remark:  The  last  few  years  have  seen  a 
great  improvement  in  our  understanding  of  the  glottal  v-v 
and  its  dependence  on  the  glottal  area  and  the  supraglottal 
tract  (49,50,51).  However,  there  have  been  very  few 
systematic  studies  in  which  the  glottal  area  and  the  glottal 
volume  velocity  have  been  obtained  in  synchrony.  Thus,  even 
without  the  EGG,  the  present  data  base  should  prove  useful 
in  the  testing  and  verification  of  these  theories. 

The  Linear  Model  for  Voiced  Speech 


Almost  all  techniques  for  inverse  filtering  the  speech 
signal  to  obtain  the  glottal  v-v  are  based  on  the  linear 
model  shown  in  Figure  4.1.   The  source  is  assumed  to  be  a 
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periodic  waveform  generator  which  outputs  pulses  of  v-v. 
The  v-v  is  input  to  a  linear,  time  invariant  vocal  tract 
filter.  The  transfer  function  of  the  vocal  tract  filter  is 
determined  by  the  supraglottal  articulators.  The  output  of 
this  filter  is  then  passed  through  a  second  filter  that 
models  the  radiation  at  the  lips,  and  is  finally  output  as 
speech.  While  conceptually  and  computationally  simple,  this 
model  is  not  strictly  correct,  because  the  assumption  that 
the  source  and  the  tract  are  linearly  separable,  i.e.,  they 
do  not  influence  one  another,  is  incorrect.  As  a  matter  of 
fact,  the  glottal  v-v  is  affected  by  the  vocal  tract 
transfer  function,  and  the  above  linear  model  needs  a 
careful  interpretation.  Since  this  interpretation  is 
essential  to  understanding  the  limitations  of  inverse 
filtering  schemes,  we  now  discuss  the  steps  leading  to  the 
1 i  near  model . 

The  physiological  system  producing  voiced  speech 
consists  of  two  interacting  subsystems:  the  mechanical 
vibrations  of  the  vocal  folds  and  the  sub-and  supra-glottal 
acoustic  filters.  The  Ishi zaka-Fl anagan  model  (52)  or  the 
model  of  Titze  (27)  leads  to  a  set  of  coupled  differential 
equations  describing  the  total  system.  The  complexity  and 
computational  requirements  of  these  models  are  substantial, 
and  they  do  not  lead  to  practical  schemes  for  inverse 
filtering  (see,  however,  Note  1).  Now,  even  though  the  two 
systems  are  coupled,  extensive  simulations  with  these  models 
(as  well  as  observations  of  vocal  fold  vibration)  do  not 
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show  any  significant  influence  of  the  vocal  tract  on  the 
vibration  or  the  glottal  area  function.  What  does  seem  to 
be  affected  by  this  coupling  is  the  glottal  v-v. 

Many  of  the  current  studies  in  the  glottal  sound  source 
are  concerned  with  making  a  simpler  analysis  of  the  source- 
tract  coupling  effects  than  afforded  by  the  Ishizaka- 
Flanagan  model  (49,50,51).  Here,  the  glottal  area  function 
is  treated  as  given,  and  a  lumped  parameter  electrical 
equivalent  circuit  is  used  for  the  vocal  tract.  This  model 
is  shown  in  Figure  4.4.  For  simplicity,  only  a  single 
formant  vocal  tract  is  represented.  The  time-varying 
resistance,  Rg(t),  and  inductance,  Lg(t),  are  the  glottal 
resistance  and  inductance  respectively,  and  are  controlled 
by  the  assumed  area  function  as  well  as  the  current  flowing 
through  them.  If  the  input  impedance,  Zt,  of  the  vocal 
tract  as  seen  by  the  glottis,  is  very  small  compared  to 
Rg(t)  for  all  t,  the  current  flow  Ug(t)  would  be  determined 
mostly  by  the  glottal  impedance,  and  there  would  be 
negligible  source-tract  coupling.  This  assumption  is  true 
only  when  the  glottal  area  Ag(t)  is  zero  or  very  small; 
during  the  glottal  open  phase,  Rg(t)  and  Zt  are  comparable 
and  this  loading  effect  influences  Ug(t)  as  follows: 

1)  The  inertive  nature  of  Zt  at  frequencies  below  that 
of  the  first  formant  causes  a  delay  in  the  peak  of  Ug(t)  as 
compared  to  Ag(t).  This  results  in  a  steeper  slope  at 
closure  (and  consequently  more  high  frequency  energy)  in 
Ug(t)  as  compared  to  Ag(t)  (50). 
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2)  The  finite  values  of  Rg(t)  during  the  glottal  open 
phase  cause  an  increase  in  the  effective  bandwidth  of  the 
resonant  frequencies;  since  Rg(t)  is  time-varying,  the 
frequency  of  resonance  also  changes.  This  modulation  effect 
on  the  frequencies  and  bandwidths  of  the  resonances  of  the 
system  is  the  critical  one  in  considering  inverse  filtering 
(6). 

Now,  in  spite  of  this  coupling  effect,  if  we  define 
everything  to  the  left  of  the  dashed  line  in  Figure  4.4  as  a 
source  with  its  output  being  the  actual  volume  velocity 
(current)  flow  Ug(t)  for  a  gi  ven  vocal  tract  configuration, 
then  in  this  context,  the  linear  time  invariant  model  of 
Figure  4.1  is  valid.  The  decoupling  of  the  glottal  and 
supraglottal  systems  is  achieved  by  including  in  the  source 
all  the  effects  of  source-tract  coupling.  Note,  however, 
that  now  the  defined  source  is  not  independent  of  the  tract. 

Having  established  the  framework  in  which  the  model  of 
Figure  4.1  is  valid,  we  return  to  some  general  discussions 
on  glottal  inverse  filtering  based  on  this  model.  First, 
since  all  the  blocks  are  linear,  we  can  interchange  the 
order  of  vocal  tract  filtering  and  radiation  leading  to 
Figure  4.2.  It  is  well  known  that  the  radiation  term  can  be 
approximated  very  well  by  a  differentiation  (2).  Combining 
the  first  two  blocks  of  Figure  4.2,  we  arrive  at  Figure  4.3 
where  the  source  is  now  the  differentiated  glottal  v-v. 

The  vocal  tract  filter  for  vowel  sounds  is  usually 
modeled  as  an  all-pole  filter;  this  can  be  theoretically 
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Figure  4.1  Linear  model  for  voiced  speech 
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Figure  4.2   Model    of   Figure   4.1    with   vocal    tract   filter 
and   radiation    interchanged. 
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Figure  4.3  Model  of  Figure  4.2  with  source  and  radiation 
combined . 
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Figure  4.4  A  simple  model  to  study  the  source  -  tract 
interaction  effects 
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justified  on  the  basis  of  acoustic  tube  modeling  of  the 
vocal  tract  (2).  The  inverse  of  the  vocal  tract  filter 
therefore  contains  only  zero's  or  anti resonances .  If  the  ra 
diated  speech  wave  is  passed  through  this  inverse  vocal 
tract  filter,  the  output  will  be  the  differentiated  glottal 
v-v.  A  simple  integration  of  this  signal  will  yield  the 
glottal  v-v. 

The  various  inverse  filtering  schemes  described  in  the 
next  section  differ  essentially  in  the  methods  of  estimating 
or  implementing  the  inverse  vocal  tract  filter.  Now  the  the 
problem  of  estimating  the  resonances  (frequency-  and 
bandwidth)  of  the  vocal  tract  filter  is  a  common  one  in 
speech  analysis.  Techniques  such  as  short-time  Fourier 
analysis,  linear  prediction  and  homomorphic  processing  have 
been  applied  to  this  problem  (53,54).  Note  however  that  any 
analysis  scheme  that  is  applied  over  several  pitch  periods 
(or  even  an  entire  period),  and  assumes  a  time-invariant 
vocal  tract  filter  over  the  analysis  duration  will  lead  to 
erroneous  results  because  of  the  source-tract  coupling 
effects  mentioned  earlier.  Only  during  the  glottal  closed 
phase  are  the  vocal  tract  characteristics  stationary.  To 
estimate  the  vocal  tract  filter  of  Figure  4.1,  analysis 
should  be  restricted  to  this  interval.  This  is  what  has  led 
to  the  concept  of  closed-phase  speech  analysis  (6,19,55). 


■ 
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Some  Inverse  Filtering  Techniques 


We  present  in  this  section  some  of  the  wide  variety  of 
inverse  filtering  techniques  possible.  The  primary 
motivation  is  to  show  that  while  many  different 
implementations  are  possible  they  all  lack  an  objective 
criterion  in  deciding  when  the  signal  output  is  the  true 
glottal  volume  velocity. 

Miller.  One  of  the  first  investigators  to  successfully 
obtain  the  glottal  volume  velocity  was  R.L.  Miller  in  1959 
(44).  Miller  used  a  linear  phase  low  pass  filter  to  remove 
the  second  and  higher  formants  and  an  analog  zero  circuit  to 
cancel  out  the  first  formant  of  the  speech  wave.  He 
initially  used  a  spectrographi c  analysis  of  the  speech 
signal  to  obtain  the  settings  for  the  inverse  filter 
network,  but  later  appears  to  have  abandoned  this  step, 
setting  the  controls  directly.  The  criterion  used  in 
deciding  the  correct  setting  was  that  the  resulting  glottal 
v-v  should  have  a  "flat"  closed  phase. 

Hoi mes .  J.N.  Holmes  improved  Miller's  inverse 
filtering  technique  by  including  antiresonances  or  zero's 
for  five  formants  (56).  The  inverse  filter  controls  were 
adjusted  to  produce  minimum  formant  ripple  in  the  output  v- 
v.  Holmes  was  primarily  interested  in  the  improvement  of 
the  naturalness  of  speech  synthesizer  outputs  when  such 
measured  v-v  waveforms  were  used  as  the  excitation  source 
(7). 
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Nakatsui   and   Suzuki 


M.   Nakatsui   and  J.   Suzuki 


implemented  the  methods  of  Miller  and  Holmes  in  the 
discrete-time  domain  (57).  Their  digital  inverse  filter  had 
adjustable  zeros  for  the  first  three  formants  and  fixed 
zeros  for  the  forth  and  fifth  formants.  The  bandwidths  were 
computed  using  a  fixed  formula  for  all  the  formants.  Again, 
a  flat  closed  phase  was  used  as  the  criterion  in  adjusting 
the  filter  coefficients. 

Mathews,  Miller  and  David.  This  method  computes  the 
glottal  v-v  using  a  pitch-synchronous  analysis  technique 
(58).  The  method  consists  of  computing  the  Fourier 
coefficients  of  a  pitch  period  of  the  speech  pressure 
signal,  locating  the  formant  frequencies,  removing  the 
formant  poles  from  the  spectrum,  and  regenerating  the 
glottal  waveform  from  the  residual  by  Fourier  synthesis. 
Note  that  since  the  analysis  is  done  over  an  entire  pitch 
period,  the  estimated  formant  frequencies  and  bandwidths 
will  be  incorrect. 

Sondhi  .  M.M.  Sondhi  proposed  a  method  of  inverse 
filtering  in  which  the  speaker  inserts  one  end  of  a  hard 
walled  acoustic  tube  into  his  or  her  mouth  while  phonating 
(59).  If  the  tube  is  properly  matched  and  has  a 
reflectionless  termination,  Sondhi  showed  that  the  pressure 
picked  up  anywhere  in  the  tube  should  be  a  delayed  version 
of  the  glottal  v-v.  The  method  does  not  work  with  a 
recorded  phonation  and  cannot  be  used  simultaneously  with 
filming  of  the  vocal  folds. 
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Rothenberg.  M.R.  Rothenberg  used  an  analog  inverse 
filtering  scheme  similar  to  those  of  Miller  and  Holmes.  The 
novel  feature  of  Rothenberg's  technique  is  the  use  of  a 
ci rcumf erenti al  ly  vented  pneumotachograph  mask  to  directly 
measure  the  air  volume  velocity  at  the  mouth  (60).  In  such 
a  case,  the  radiation  filter  of  Figure  4.1  is  not  present, 
and  response  down  to  zero  frequency  can  be  obtained.  Also, 
using  suitable  calibration,  absolute  airflow  levels  can  be 
measured.  Since  no  integration  of  the  inverse  filter  output 
is  required  to  obtain  the  glottal  v-v,  the  method  is  less 
sensitive  to  low-frequency  noise  than  schemes  that  use  the 
radiated  pressure  wave.  The  primary  disadvantage  is  that 
the  mask  has  a  frequency  response  only  up  to  1.5  KHz. 

Later,  Rothenberg  and  Zahorian  (61)  reported  a 
nonlinear  filtering  scheme,  where  by  using  suitable 
feedback,  the  inverse  filter  anti resonance  frequency  and 
bandwidth  are  changed  synchronously  with  the  glottal  flow  to 
simulate  the  effects  of  the  frequency  and  bandwidth  changes 
during  the  open  phase.  Under  these  conditions,  the  inverse 
filter  output  should  be  proportional  to  the  glottal  area  of 
Figure  4.2.  Fant  (62),  however,  states  that  the  method  may 
not  be  correct.   In  any  case,  it  is  difficult  to  instrument. 

Berouti .  M.  Berouti  ,  in  1976,  proposed  a  method  for 
accurately  estimating  the  vocal  tract  formant  frequencies 
and  bandwidths  from  the  speech  signal  by  analysis  over  the 
closed  glottis  interval  (19).  His  approach  is  based  on  the 
discrete  time  linear  prediction  technique  that  has  proved 
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very  successful  in  speech  analysis.  Berouti  identifies  the 
closed  glottal  interval  by  visual  inspection  of  the  speech 
signal.  Since  his  method  is  a  special  case  of  a  more 
general  approach  to  be  described  next,  we  do  not  discuss  it 
f u  rther. 

Wong,  Markel  and  Gray.  In  1979  0.  Wong,  J.  Markel  and 
A.  Gray  published  their  inverse  filtering  technique  based  on 
a  linear  prediction  model  for  the  speech  signal  (63).  By  a 
careful  analysis  of  the  sequence  of  events  in  a  single 
glottal  period,  they  were  able  to  propose  a  criterion  for 
locating  the  interval  of  glottal  closure. 

Since  their  method 

1)  was  the  only  one  available  that  had  an  objective 
procedure  for  selecting  the  inverse  filter  and 

2)  could  be  implemented  in  software  without  extensive 
new  instrumentation, 

it  was  decided  to  use  this  method  to  carry  out  the 
inverse  filtering  tasks  required  for  this  study.  The 
theoretical  and  practical  implementation  considerations  of 
the  method  are  presented  in  the  next  section. 


The  ClosedPhase  Covariance 

Method  of  Inverse  Filtering' 


Theory 

The  inverse  filtering  method  of  Wong,  Markel  and  Gray 
is  based  on  a  discrete  time  formulation  of  the  linear  speech 
production  shown  in  Figure  4.5(a). 
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The  various  z  transforms  and  their  time  sequences  are 
def i ned  as  f ol 1 ows  : 


U  (z)  -  U  (n),  the  glottal  volume  velocity  signal 

ue(z)  -  U  (n),  the  lip  volume  velocity  signal 

S(z)  «  s(n),  the  radiated  speechsignal 

V(z),  the  z-transform  of  the  vocal  tract  filter 

R(z),  the  z-transform  of  the  radiation  filter. 


Since  all  systems  are  linear,  the  radiation  and  vocal 
tract  filters  can  be  interchanged  and  the  radiation  combined 
with  the  source,  leading  to  the  arrangement  of  Figure 
4.5(b).  The  merging  of  the  source  and  radiation  terms  leads 
to  the  definition  of  a  new  effective  source,  q(n) 


q(n)  -  Q(z)  =  Ug(z)R(z) 


The  radiation  term  is  a  differentiator  and  has  a  z- 
transform 


R(z) 


so 


Q(z) 

which  leads  to 

q(n) 

z-transform. 


1-z 


U  (z)d-z-1) 


ug(n)  -  ug(n-l) 


on  taki  ng  the  i  n verse 


The  vocal  tract  filter  V(z)  is  modeled  as  an  all  pole 
f i 1 ter  wi  th  p  pol es . 


source 

UgCZ) 
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Figure   4.5(a)    Linear,    discrete    time   model    for   voiced    speech 


new 
ef fee  tive 
source 

GHz) 


q(n) 


vocal 
tract 
filter 
VCz) 


sfn) 


Figure  4.5(b)  Modified  model  from-  Figure  4.5(a) 
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V(z)  = 


1  +  Z 
i  =  l 


ai  z 


From  Figure  4.5(b),  we  have 
S(z)  =  V(z)Q(z) 


1  +   E  a.  z 
i  =  l  n 


Q(z) 


so 


-1 


S(z)(l  +  e  a.  z   )  =  Q(z)    and 
1-1 


taking    the    inverse    z-transform    leads    to 

P 
s(n)    +    i      a.    s(n-i  )    =    q(n) 
1-1      n 


(1) 


Let  us  assume  that  a  stable  closed  glottal  condition 
exists  between  sampling  instants  n  =  L  and  n  =  L   -l.   Then 
uq(n)  =  o  between  n  =  L   and  n  =  LQ  -1  and  consequently  q(n) 
=  u  (n)  -  u  (n-1)  is  zero  between  n  =  l_c  +  1  and 
n  =  l_0  -1.   This  is  shown  in  Figure  4.6. 

Then  between  n  =  l_c  +  1  and  n  =  LQ  -1  equation  (1) 
reduces  to 


or 


s(n)+   e  a .  s  (n-i  )  =  o 
1=1  ^ 


E   a  .  s(n-i  )  =  -s (n) 
1*1   n 


(2) 
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WM(L  -M) 
M   o 


n=Lc+l 


n=L. 


n=L, 


Figure  4.6  Illustration  of  the  glottal  volume  velocity,  the 
differentiated  glottal  v-v  and  different  analysis 
windows  in  the  closed  phase  covariance  method. 
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Now  s(i)  is  the  observed  (known)  speech  output  and  the 
a.j's  are  the  unknown  vocal  tract  parameters.  The  set  of 
equations  represented  by  (2)  can  therefore  be  interpreted  as 
Lp  -  Lc  -  1  1  i  near  equations  in  the  p  unknowns 
a j  ,1*1 » . . .  ,p.  From  now  on  we  assume  that  L  -  L_  -  1  >  p 
i.e.  the  duration  of  the  closed  phase  (=  L_  -  L  )  is  at 
least  p  +  1  samples. 

Define  the  analysis  window,  WM(1),  at  position  1  and  of 
length  M  +  p,  as  the  set  of  speech  samples  s(l-p),..., 
s  (1 -1 ) ,s (1  ) , . . . ,s (1 +M-1 ) .   Some  typical  analysis  windows  are 
shown  in  Figure  4.6.    For  any  general  1,  equation(l)  is 

valid  for  the  speech  samples  s(l) s(l+M-l)   in  the 

analysis  wi  ndow. 


i.e. 


s(k)  +  z      a.  s(k-i)  =  q(k)    k  =  1 1  +  M  -  1 

i  =  l   7 

However,  for  those  analysis  windows  such  that 
M  <  LQ  -  l_c  -1  and  1  is  the  range  l_c  +  1  <  1  <  LQ  -  M 
equati  on  (2)  is  val i  d 


s(k)  +   e  a.  s(k-i) 
i  =  l   1 


k  =  1  ,...  ,1  +  M  -  1 
1  -  Lc  +  1,...,  L   -  M 


(3) 
The  two  extreme  windows   for  which   equation   (3)   is 
valid,  namely  WM  (l_c  +  1)  and  WM  (LQ  -  M)  ,  are  also  shown  in 
Figure  4.6. 
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Berouti's  method  consists  of  choosing  M  =  p.  Then  for 
any  1  in  the  range  Lc  +  1  <  1  <  LQ  -  M,  equations  (3) 
represent  p  linear  equations  in  the  p  unknowns  it ,1*1 ,. . . ,p, 
and  so  can  be  solved  for  the  a.,-s.  Once  the  a.j's  are  known, 
the  vocal  tract  filter  is  determined. 

Wong,  Markel  and  Gray's  generalization  of  Berouti's 
method  is  based  on  the  observation  that  for  typical  male 
voices,  LQ  -  Lc  -  1  is  much  larger  than  the  value  of  p 
needed  to  model  the  vocal  tract  filter.  The  study  by  Gish 
(6)  discusses  the  interaction  between  p,  the  duration  of  the 
closed  phase  and  the  voice  type  in  more  detail.  For  us,  it 
is  sufficient  to  observe  'that  for  M  in  the  range  p  <  M  <  L 

Lc  -  1 ,  the  set  of  equations  (3)  represent  an 
overdetermi ned  system  of  linear  equations  in  the  p  unknowns 
a.,-  ,  i=l,...,p.  Wong  et  al  .  proposed  determining  the  a^'s  as 
the  least  squares  solution  to  this  system  of  equations. 

Exactly  the  same  form  of  analysis  is  done  in  the  so 
called  covariance  method  of  linear  prediction  (58,64);  the 
only  difference  is  that  the  general  covariance  method,  M  is 
typically  chosen  to  extend  over  several  pitch  periods.  Thus 
it  is  appropriate  to  call  the  Wong  et  al .  method  the  closed 
phase  covariance  method  of  linear  prediction. 

If  a  4 ,  i  =  l,...,p  are  the  least  squares  solution  to 
equations  (3)  for  the  analysis  window  WM  (1),  the  residue  or 
error  signal  is  defined  as 


e(j)  =  s(j)  +   E   a.  s(j-i  ) 
i  =  l 
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A 

In  the  ideal  case,  a.j  =  a.,- ,  i=l,...,p,  i.e.,  the  model 
is  exact  and  e(j)  =  o,  j  =  1,...,  1  +  M  -  1.  When  actual 
speech  data  are  used,  however,  this  will  typically  not  be 
the  case  and  the'  "goodness  of  fit"  of  the  model  is  measured 
by  the  total  squared  error. 


ehO) 


1+M-l   , 
E    e  (k) 
k  =  l 


To  ensure  some  independence  from  the  signal  level,  it 
is  more  convenient  to  define  a  normalized  total  squared 
error,  FM(1),  as  the  measure  of  the  "goodness"  of  our 
model i  ng. 


F  (1)  =  EM^  /    i    s2(i) 

i=l 


FM(1  )  is  just  the  total  squared  error  normalized  by  the 
signal  energy. 

Once  the  vocal  tract  filter  parameters,  ai ,  i=l p 

have  been  determined,  it  is  a  relatively  simple  matter  to 
obtain  the  glottal  volume  velocity.  First  the  speech 
signal,  s(n),  is  passed  through  the  inverse  of  the  vocal 
tract  filter, 


A(z)  =1+   I   a,  z 
i=l   1 


to  yield  the  effective  function  q(n).    Then,  q(n)  can  be 
integrated  to  obtain  the  glottal   volume  velocity,  uq(n). 
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Note  that  the  initial  condition  required  for  the  integrator 
is  unknown  and  so  uq(n)  is  recovered  only  to  an  artibrary 
additive  constant. 
Practical  Analysis  Considerations 

Locating  the  glottal  closed  phase.  We  implicitly 
assumed  in  the  previous  section  that  the  duration  and 
location  of  the  closed  phase  were  available  apriori.  This 
is  not  the  case  in  practice,  and  the  closed  phase  has  to  be 
located  either  from  the  speech  signal  itself  or  by  using 
some  other  auxiliary  signal.  Recall  that  the  use  of  the  EGG 
for  this  purpose  was  identified  as  one  of  the  research  goals 
of  this  study. 

Wong,  Markel  and  Gray  suggest  a  method  of  identifying 
this  period  from  the  speech  signal  itself  by  using  the 
variation  of  the  normalized  total  squared  error  FM ( 1  )  with 
the  analysis  window  WM(1).  Briefly,  their  technique  is  the 
fol 1 owi  ng  : 

A  sequential  covariance  analysis  is  performed  on  the 
input  data  record,  i.e.,  the  analysis  window  is  moved  by  one 
point  each  time  and  a  new  analysis  done.  This  results  in  a 
sequence  of  normalized  total  errors  FM(1),  FM(1+1),...  .  In 
those  analysis  cases  where  the  window  includes  data  samples 
s(i)  that  obey  equation  (2)  rather  than  equation  (3),  i.e., 
q(i)  *  0,  the  error  FM(1)  will  be  large.  On  the  other  hand, 
when  the  analysis  window  is  one  of  the  set, 
WM( 1 ) , . . . ,WM( L-M)  shown  in  Figure  4.6,  this  error  FM(1)  will 
be  \/ery       small.     Thus   a   simple   thresholding   of   the 
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normalizied  total  error  sequence  FM(1)  should  locate  the 
closed  glottal  interval. 

The  effectiveness  of  this  procedure  has  been  questioned 
by  other  authors  (6,65);  this  author  has  also  found  that  the 
method  is  not  very  useful.  In  this  study,  the  analysis 
window  length  and  the  location  of  the  closed  glottal 
interval  were  determined  by  observing  the  glottal  area 
function  or  the  EGG.  Typically  values  used  for  M  +  p  range 
from  28  to  40. 

Analysis  filter  order, p.  A  second  assumption  implicity 
made  in  the  last  section  was. that  the  filter  order,  p,  which 
was  needed  to  accurately  model  the  vocal  tract  filter,  was 
known.  Again,  in  practice,  this  is  not  the  case.  Even  if  p 
is  known,  noise,  or  a  non-zero  mean  in  the  analysis  window 
can  lead  to  extraneous,  non-formant  poles  in  the  vocal  tract 
filter.  Typically,  then,  one  has  to  use  an  analysis  order 
greater  than  twice  the  number  of  expected  formants  in  the 
speech  wave.  Since  extraneous  zeros  in  the  inverse  filter 
will  distort  the  derived  volume  velocity,  they  need  to  be 
removed  before  inverse  filtering.  This  is  accomplished  by 
factoring  the  inverse  filter  polynomial 


A(z)=   e  a.  z"1*  1 
i=l   n 


deleting   all   extraneous   zeros   and   recomputing   the   new 
inverse  filter. 
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To  aid  in  deciding  the  value  of  p  to  be  used  and  in 
locating  the  "genuine"  inverse  filter  zeros,  the  spectrum  of 
the  speech  signal  is  computed  and  displayed.  The  spectrum 
is  computed  by  two  different  methods: 

i)  a  sixteenth  order  autocorrelation  linear  prediction 
spectrum  computed  from  a  300  point  Hamming  windowed  segment 
of  the  speech  signal  (53,64)  and 

ii)   a  300  point  Hamming  windowed  FFT  spectrum. 

The  two  different  spectra  are  displayed  together. 

Preemphasi  s    Preemphasis   in   the   speech  literature 

refers   to   a   simple   high   pass   filtering  of  the  speech 

signal.     In   discrete   time   systems,   this  is   usually 
implemented  by  the  filter  (53,64) 


H(z) 


1-az 


1 


0.95  <  a  <  1. 


Preemphasis  is  usually  justified  on  various  grounds 
depending  on  the  subsequent  analysis  to  which  the  speech 
signal  is  subjected.  In  all  cases,  it  "mysteriously"  seems 
to  improve  the  analysis. 

Wong,  Markel  and  Gray  suggest  preemphasi zi ng  the  speech 
signal  in  the  closed  phase  covariance  method.  Their 
rationale  is  that  it  helps  in  locating  the  closed  glottal 
interval.  As  stated  earlier,  their  algorithm  for  locating 
the  closed  phase  fails  -  with  or  without  preemphasis. 
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Theoretically,  the  closed  phase  covariance  method  does 
not  require  preemphasis.  This  author  has  found  that  the 
closed  phase  vocal  tract  filters  derived  with  and  without 
preemphasis  are  virtually  the  same. 

As  a  matter  of  record,  most  of  the  closed  phase 
covariance  analysis   for   this   study  was   done  with   the 
preemphasi zed  speech  signal. 
Implementation  of  the  Method 

The  software  implementation  of  the  closed  phase  inverse 
filtering  scheme  consists  of  two  main  programs.  The  first 
program  does  a  sequential  covariance  analysis  of  the  speech 
data  and  stores  the  analysis  results  in  a  disk  file. 

The  second  main  program  uses  the  output  of  the  first 
program  to  obtain  a  suitable  inverse  filter.  The  program  is 
interactive  and  uses  numerous  graphic  routines  to  display 
the  relevant  waveforms  at  the  various  stages  in  the 
processing.   The  basic  functions  of  this  program  are 

i)  to  allow  the  user  to  interactively  select  the 
analysis  window  to  be  used, 

ii)  to  modify  the  selected  inverse  filter  by  pole 
del eti  on , 

iii)  to  inverse  filter  the  speech  and  display  the 
resulting  waveforms  and 

iv)  to  return  to  a  previous  processing  step  at  any 
time  if  the  results  are  unsatisfactory. 

A  functional  block  diagram  of  this  program  is  shown  in 
Figure  4.7  and  Figure  4.8  illustrates  some  typical  steps  in 
the  program. 
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START 


YES 


STOP 


Read  in  the  sequential 
covariance  analysis 
results  and  plot. 
(See  Figure  4.8(a)) 


Get  the  user 
selection  of 
the  analysis 
window. 


1.  Solve  the  filter 
polynomial  for  the 
roots  and  output 
to  the  terminal . 


Get  the  user 
input  of  the 
poles  to  be 
deleted. 


Form  the  new  filter 
polynomial  and  plot  the 
frequency  response.  (See 
Figure  4.8(b)) 
Inverse  filter  the  speech 
and  plot  the  Diff.  v-v, 
integrate  it  and  plot 
the  v-v.  (See  Figures 
4.8(c)  and  4.8(d)) 


NO 


Get  the  user 
selection  of 
the  point  to, 


Figure  4.7  Flowchart  of  the  inverse  filtering  program 
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3.ae 


Analysi  s  wTTTcfow  selected 
Figure  4.3(a)  Selection  of  the  analysis  window  using 
the  differentiated  EGG  and  sequential" 
total  prediction  error. 
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Figure  4.3(b)  Vocal  tract  filter  transfer  function  after 
pole  deletion. 
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Figure  4.0(c)  Differentiated  v-v  obtained  after 
inverse  filtering 
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Figure  4.0(d)  v-v  obtained  by  removing  the  mean  from  the 
differentiated  v-v  and  integrating. 
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Results  from  the  Synchronized  Data  Base 


Subset  of  Data  Base  Inverse  Filtered 


The  analyses  in  the  previous  sections  imply  that  the 
closed  phase  covariance  method  of  inverse  filtering  requires 
the  existence  of  a  closed  glottal  interval  of  at  least  p+1 
samples,  where  p  is  the  order  of  the  vocal  tract  filter 
model.  In  practice,  complete  glottal  closure  is  not 
required  as  long  as  the  minimum  glottal  area  is  small  enough 
that  the  glottal  v-v  during  this  phase  is  nearly  zero.  Of 
course,  the  duration  of  this  interval  of  "small"  glottal 
area  has  to  meet  the  requirements  stated  earlier.  All  the 
high  fundamental  frequency  (340  Hz)  tasks  of  all  the 
subjects  as  well  as  a  few  of  the  low  and  medium  frequency 
tasks  of  two  of  the  subjects  did  not  meet  one  or  both  of  the 
above  requirements.  After  eliminating  these  data  sets,  only 
20  of  the  total  36  tasks  were  left  as  potential  candidates 
for  inverse  filtering.  This  subset  of  the  data  base  was 
analyzed  to  obtain  the  glottal  v-v. 

Three  new  data  sets  of  synchronized  EGG  and  speech  were 
also  included  in  the  inverse  filtering  analysis.  One  of 
these  data  sets  is  from  signals  recorded  on  a  4  channel  FM 
tape  recorder  and  the  other  two  were  directly  digitized  by 
the  computer.  Consequently  both  synchronization  and  tape 
distortion  errors  are  virtually  absent  in  these  cases;  this 
was  the  primary  motivation  for  including  the  new  data.  Note 
that  no  glottal  area  is  available  for  these  tasks. 
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Temporal  Comparisons  Between  the  EGG,  the  Glottal  Area  and 
the  Glottal  v-v  ~~  ~~ 


Alignment  of  the  synchronized  data.  The  procedure  for 
the  alignment  of  the  high  speed  film  data  and  the  speech 
signal  was  explained  in  Chapter  2.  This  alignment  procedure 
assumes  that  the  speech  and  the  EGG  signals  digitized  from 
the  tapes  are  in  alignment  (except  for  a  fixed  propagation 
delay).  Thus  mismatches  in  the  sampling  rates  and  tape 
speeds  will  cause  errors  in  this  alignment  and  consequently 
in  the  alignment  of  the  speech  with  the  high  speed  films. 
Preliminary  investigations  of  the  alignment  between  the 
glottal  v-v  and  the  glottal  area  showed  that  small  alignment 
errors  did  exist.  Therefore,  the  glottal  v-v  and  the 
glottal  area  waveforms  had  to  be  realigned  based  on  the 
expected  relationship  between  them.  Specifically,  this  was 
accomplished  by  shifting  the  glottal  v-v  signal  until  the 
minimum  value  in  the  differentiated  v-v  coincided  with  the 
instant  of  glottal  closure  as  indicated  by  the  glottal 
area.  The  shift  required  to  accomplish  this  was  between  0 
and  0.5  ms  for  all  the  data  sets.  Note  that  while  this 
procedure  establishes  alignment,  it  cannot  compensate  for 
the  differences  in  the  tape  and  film  speeds.  As  a  result 
the  glottal  area  and  the  glottal  v-v  waveforms  appear  to 
slowly  drift  out  of  synchrony  in  some  of  the  synchronized 
data  plots  presented  below. 

Synchronized  data  plots.  Figures  4.9-4.15  are  a 
representative  cross-section  of  the  synchronized  data  plots 
obtained  in  this  study.    Figures  4.9-4.12  are  from  the 
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synchronized  data  base  of  speech,  EGG  and  the  high  speed 
films.  Figure  4.13  is  from  the  data  obtained  from  the  FM 
tape  recorder  while  Figures  4.14  and  4.15  use  speech  and  EGG 
signals  which  were  directly  digitized  into  the  computer. 

The  arrangements  of  the  plots  in  these  figures  is  as 
f ol 1 ows  : 

The  first  graph  is  of  the  EGG  signal  (EGG).  The  second 
is  that  of  the  glottal  v-v  (v-v,  solid  line),  and  the 
glottal  area  (A,  dotted  line)  if  available.  The  third  and 
fourth  graphs  are  of  the  differentiated  EEG  (D  EGG)  and  the 
differentiated  v-v  (D  v-v)  respectively.  Dotted  vertical 
lines  have  been  drawn  on  these  graphs  to  aid  in  comparing 
the  features  between  the  various  signals.  In  all  the  data 
sets  for  which  a  glottal  area  with  a  closed  glottal  phase 
exists,  these  vertical  lines  have  been  drawn  to  coincide 
with  the  instants  of  glottal  opening  and  glottal  closure. 
In  all  the  remaining  data  sets,  the  lines  have  been  drawn  at 
the  locations  of  the  maximum  and  minimum  in  the 
differentiated  EGG  for  each  glottal  period. 

Discussion  of  the  synchronized  data  plots  The  instants 
of  glottal  opening  and  closure  represent  significant  events 
in  the  glottal  cycle.  An  algorithm  for  the  identification 
of  these  events  using  the  EGG  was  presented  and 
experimentally  evaluated  in  the  last  chapter.  We  found  that 
the  glottal  closure  event  was  predicted  very  reliably  by  the 
minimum  in  the  differentiated  EGG.  Identifying  the  glottal 
opening  event  as  the  maximum  in  the  differentiated  EGG, 
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Figure  4.9  Synchronized  data  plots,  Subj:  JMN 
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Figure  4.10(a) 
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Figure  4.10(b) 


Figure  4.10  Synchronized  data  plots,  Subj:  DMK 
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Figure  4.11(b) 

Figure  4.11  Synchronized  data  plots,  Subj:   GPM 
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Figure  4.12  Synchronized  data  plots,  Subj:  AKK 
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SUBJ     :    RKK  TASK    8 

Figure  4.13  Synchronized  data  plot;  FM  tape  recorder  used 
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Figure  4.14  Synchronized  data  plot;  directly  digitized  data 
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Figure  4.15  Synchronized  data  plots;  directly  digitized  data 
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while  succesful  ,  was  not  consistent.  As  might  be  expected, 
this  relationship  is  maintained  between  the  differentiated 
v-v  and  the  differentiated  EGG  also.  Thus,  the  minimum  in 
the  differentiated  v-v  agrees  very  well  with  the  minimum  in 
the  differentiated  EGG.  The  start  of  the  v-v  pulse,  which 
is  the  point  where  the  differentiated  v-v  starts  to  increase 
from  its  flat  zero  value,  agrees  with  the  maximum  in  the 
differentiated  EGG.  We  found  in  Chapter  3  that  the  error  in 
locating  the  instant  of  glottal  opening  using  the  EGG  was 
speaker  dependent.  Figures  4.12(a)  and  4.12(b)  show  the 
synchronized  plots  for  subject  AKK,  for  whom  this  error  was 
consistently  large.  As  these  figures  show,  the  errors  are 
carried  over  in  comparing  the  synchronized  EGG  and  glottal 
v-v  also. 

Figure  4.9(b)  corresponds  to  the  task,  subj:  JMN, 
170Hz,  72dB  discussed  earlier  in  Chapter  3.  Here  the 
presence  of  a  mucus  bridge  across  the  vocal  folds  causes  the 
glottal  area  to  increase  very  slowly  until  the  mucus 
separates.  This  leads  to  a  knee  in  the  glottal  area  during 
the  opening  phase.  The  glottal  v-v  also  increases  very 
little  until  after  the  knee.  The  EGG,  which  is  affected  by 
the  conductive  path  provided  by  the  mucus,  does  not  show  a 
rapid  change  until  the  mucus  bridge  begins  to  break.  As  a 
result,  the  maximum  in  the  differentiated  EGG  and  the 
instant  of  increase  in  the  differentiated  v-v  are  better 
correlated  that  the  maximum  in  the  differentiated  EGG  and 
the  instant  of  glottal  opening  from  the  glottal  area. 
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Now,  theoretically,  the  glottal  v-v  and  the  glottal 

area  both  reach  zero  simultaneously  at  glottal  closure. 

This   implies   that   for   sampled   data   signals,   the 

differentiated  glottal  v-v  reaches  zero  one  sample  after 

closure  as  shown  in  Figure  4.6.   However,  in  the  examples  in 

Figures   4.9-4.12,   the  differentiated   v-v   rises   to  the 

expected  zero  value  only  several  samples  after  closure.   The 

glottal  v-v,  as  a  result,  reaches  zero  considerably  after 

the  glottal  area  does  so.   The  same  effect,  although  to  a 

much  lesser  extent  is  observed  in  the  FM  tape  recorder  data 

as  well   as   the  directly  digitized  data   (Figures   4.13- 

4.15).   This  author's  opinion  is  that  the  observed  behaviour 

is   due   to   a   combination   of   alignment   errors,   phase 

distortion  due  to  the  analog  anti-aliasing  filters  and  the 

tape  recorder  distortion. 

Using  the  EGG  as  an  aid  in  inverse  filtering  and  the  psuedo- 
closed  phase  analysis  method"  ~~ 

Rothenberg  first  proposed  using  the  EGG  as  an  aid  in 
inverse  filtering  in  (37).  This  paper  has  an  example  where 
two  different  settings  of  an  analog  hardware  inverse  filter 
lead  to  plausible  v-v  signals.  However,  only  one  of  these 
signals  correlated  well  with  the  EGG  signal  recorded 
simultaneously.  Thus,  the  inverse  filter  adjusted  to  give 
the  v-v  which  correlated  with  the  EGG  was  the  desired 
setti  ng. 

Our  results  also  show  that  the  EGG  can  be  used  for  this 
purpose.  Once  the  propagation  delay  for  the  glottis  to  the 
microphone  is  determined,  the  differentiated  EGG  can  provide 
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a  reasonable  estimate  of  the  duration  and  location  of  the 
glottal  closed  phase.  If  one  is  using  a  hardware  inverse 
filter,  the  settings  of  the  filter  can  be  adjusted  to 
minimize  the  ripple  in  the  output  v-v  during  this  period  of 
time.  In  the  case  of  a  software  scheme  for  inverse 
filtering,  as  used  in  this  study,  the  closed  phase  vocal 
tract  filter  coefficients  can  be  determined  by  an  analysis 
restricted  to  this  period  of  time. 

The  next  step  is  to  use  the  EGG  in  automatic  inverse 
filtering.  An  automated  inverse  filtering  scheme  has  two 
important  applications.  The  first  is  that  an  automated 
inverse  filtering  device  can  be  used  as  a  diagnostic  tool  by 
physicians  (5)  or  as  a  teaching  tool  by  speech  therapists 
and  singing  instructors.  The  second  application  is  in 
inverse  filtering  normal,  running  speech.  There  is  a  real 
need  in  speech  research  to  be  able  to  study  the  dynamic 
behaviour  of  the  voice  source  in  continuous  speech  (3). 
This  requires  the  ability  to  inverse  filter  sentence  length 
utterances,  a  feat  unaccomplished  satisfactorily  to  date. 

If  the  closed  glottal  segments  in  every  voice  period  of 
an  utterance  can  be  isolated,  then,  assuming  these  are  of 
sufficient  duration,  closed  phase  covariance  analysis  over 
these  segments  can  be  used  to  determine  the  glottal  v-v  over 
all  voiced  portions.  Gish  proposed  this  in  (6).  The  key 
here  is  the  determination  of  the  closed  glottal  phases;  the 
results  presented  in  this  study  indicate  that  the  EGG  is  one 
of  the  best  methods  to  isolate  these  segments. 
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It  therefore  seems  reasonable,  as  a  first  step  in  this 
direction,  to  attempt  two  channel  speech  analysis  using  the 
speech  and  the  EGG  as  follows: 

i)  use  the  EGG  and  the  algorithms  presented  in 
Chapter  3  to  isolated  the  closed  glottal  segments; 

ii)  do  a  covariance  analysis  over  these  segments  and 
derive  the  closed  phase  vocal  tract  filters  and 

iii)  use  the  closed  phase  vocal  tract  filters  to 
inverse  filter  the  utterance  and  obtain  the  glottal  v-v. 

Since  i)  the  EGG  does  not  provide  a  perfect  indication 
of  the  duration  and  location  of  the  closed  glottis  segments 
and  ii)  complete  glottal  closure  may  not  occur  when  the  EGG 
indicates  glottal  closure,  it  seems  appropriate  to  call  this 
analysis  scheme  the  pseudo-closed  phase  analysis  method. 

This  analysis  technique  is  discussed  in  Chapter  5  where 
we  study  the  use  of  the  EGG  in  speech  analysis. 

Spectral  Comparisons  between  the  Glottal  Waveforms 


The  human  ear,  in  its  early  stages  of  the  processing  of 
sounds,  acts  as  a  harmonic  analysis  (2).  Arising  from  this 
fact,  the  distribution  of  energy  in  the  frequency  domain  of 
speech  and  related  glottal  waveforms  has  been  of 
considerable  interest.  Many  of  the  concepts  and  analysis 
techniques  in  speech  science  are,  in  fact,  formulated  in  the 
frequency  domain. 

The  spectrum  of  the  speech  signal,  S(w),  is  given  from 
the  linear  model  of  Figure  4.1  by 


where 
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S(w)  =  R(w)  V(w)  U  (w) 


and 


R(w)  is  the  spectrum  of  the  radiation  term 
V(w)  is  the  spectrum  of  the  vocal  tract  filter 
U  (w)  is  the  spectrum  of  the  glottal  v-v 


The  above  equation  shows  that  the  spectrum  of  the 
glottal  v-v  can  significantly  influence  the  spectrum  of  the 
speech.  Some  of  the  characteristic  differences  between 
different  voice  qualities  (e.g.,  resonance  vs  normal  vs 
breathy)  are  due  to  the  source  spectrum.  It  is,  therefore, 
of  interest  to  examine  the  spectra  of  the  different  glottal 
waveforms . 

Computation  of  the  spectra.   All  the  glottal  waveforms 
are  periodic  in   voiced  speech.    Most  digital   spectrum 


computation   is   implemented   with   an   FFT   having   n 


,N 


equispaced  data  samples  (66).  Since  n  need  not  contain  an 
integer  number  of  glottal  periods,  the  periodic  extension  of 
the  data  implicity  imposed  by  the  FFT  can  lead  to 
discontinuities  at  the  signal  ends.  Consequently,  the 
computed  spectra  can  be  in  error  by  as  much  as  3.92  dB 
(67).  Windowing  the  data  (67)  to  avoid  the  discontinuities 
at  the  signal  ends  introduces  its  own  form  of  distortion, 
and  the  pitch  harmonics  in  the  resulting  spectra  often 
obscure  the  behaviour  we  seek  to  observe. 

Most  of  these  problems  can  be  avoided  if  the  spectrum 
is  computed  pitch-synchronously  using  the  DFT  (67).  All  the 
spectra  presented  in  this  chapter  were  computed  as  follows: 
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i)  individual  pitch  periods  of  the  glottal  waveforms 
were  i  sol ated ; 

ii)  the  magnitude  of  the  DFT  coefficients  of  each  of 
these  periods  was  then  computed  and 

iii)  the  spectrum  was  computed  as  the  average  of  the 
magnitude  coefficients  from  a  number  of  adjacent  periods. 
Typically  3-5  periods  were  used. 

The  final  step  (iii)  was  included  to  provide  a  degree 
of  statistical  stability  to  the  spectrum  estimate. 

Comparison  of  the  spectra  of  the  EGG,  the  glottal  area 
and  the  glottal  v-v.  Figures  4. 16 (a )-4 . 16 (d )  are  typical 
examples  of  the  spectra  of  the  glottal  area,  the  glottal  v-v 
and  EGG.  Most  of  the  tasks  in  the  data  base  show  a  similar 
behaviour,  and  so  only  these  four  examples,  one  for  each 
subject,  are  included.  The  spectra  are  plotted  on  a  dB 
scale,  and  the  spectra  of  the  three  glottal  waveforms  have 
been  scaled  to  have  the  same  value  (OdB)  at  the  first 
harmonic  frequency.  Since  the  glottal  area  was  sampled  at 
5KHz  while  the  EGG  and  the  glottal  v-v  have  been  sampled  at 
lOKHz,  the  spectrum  of  the  area  extends  only  half  as  far  as 
the  spectra  of  the  other  waveforms. 

These  plots  show  that,  in  most  cases,  the  EGG  has  the 
most  high  frequency  energy,  followed  by  the  glottal  area  and 
finally  the  glottal  v-v.  It  is  a  surprising  observation 
that  the  glottal  area  has  more  high  frequency  content  that 
the  v-v  since  most  researchers  claim  otherwise  (2,50).  It 
is,  however,  necessary  to  consider  the  following: 


Figure  4.16  Comparison  of  the  spectra  of  the  glottal 
area,  the  EGG  and  the  glottal  v-v 
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Figure  4.16(b) 
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1)  the  sampling  of  the  glottal  area  at  5KHz  may  not  be 
adequate.  Consequently,  if  undersampl i ng  occurred  this 
would  lead  to  aliasing  in  the  frequency  domain,  which  could 
explain  the  larger  high  frequency  content  in  the  glottal 
area.  If  this  is  the  case,  then  the  observation  is  clearly 
an  artifact  due  to  the  data. 

2)  the  claim  that  the  glottal  v-v  has  more  high 
frequency  content  than  the  area  is  based  on  the  assumptions: 

i)  the  slope,  at  closure  of  the  glottal  v-v  is  usually 
larger  than  that  of  the  glottal  area,  and 

ii)  the  slope  of  both  waveforms  returns  abruptly  to 
zero  after  closure. 

The  fact  that  the  glottal  v-v  slope  in  our  data  does 
not  return  to  zero  immediately  after  closure  was  discussed 
in  a  previous  subsection;  this  same  effect  could  also 
account  for  the  lower  high  frequency  content  than 
anticipated  in  the  glottal  v-v. 

Variations  of  the  spectra  of  EGG  and  the  differentiated 
v-v  with  intensity.  An  increase  in  the  intensity  of 
loudness  of  an  utterance  is  brought  about  by  changes  in  the 
glottal  v-v.  The  corresponding  changes  required  in  the 
vocal  fold  vibration  are  not  well  understood.  However,  as 
stated  by  Fant  , 

There  appear  to  be  two  modes  available  to  produce  an 
intensity  increase.  One  is  a  rise  in  the  overall  scale 
factor  of  glottal  flow  pulses  which  is  a  main  consequence  of 
increased  subglottal  pressure.  The  other  is  an  adduction  of 
the  vocal  cords  physiologically  induced  by  a  medical 
compression  while  maintaining  or  even  reducing  the  amount  of 
air  contained  in  a  single  pulse. (3,  page  30) 
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It  is,  of  course,  possible  that  in  a  given  case  the 
intensity  increase  is  brought  about  by  a  combination  of 
these  two  modes.  While  both  modes  result  in  an  intensity 
increase,  the  effects  on  the  spectrum  of  the  output  speech 
wave  is  different  in  the  two  cases.  The  first  mode  merely 
shifts  the  spectrum  to  a  high  level  while  the  second  also 
redistributes  the  energy  in  the  spectrum.  Further,  the 
first  mode  should  not  significantly  affect  vocal  fold 
vibration,  while  the  second  mode  will  cause  th  vocal  folds 
to  close  more  rapidly.  Consequently,  the  second  mode  of 
intensity  increase  should  have  a  correlate  in  the  EGG:  an 
increase  in  the  slope  fo  the  EGG  closure.  An  increase  in 
the  slope  of  the  EGG  with  increasing  intensity  while  the 
fundamental  frequency  was  constant  has  been  observed  by  Baer 
et  al  .  (38).  Some  of  the  data  used  in  this  study  also 
support  this  view  -  an  example  is  shown  in  Figure  4.17. 

The  tasks  in  the  synchronized  data  base  allow  us  to 
systematically  test  for  the  presence  of  the  second  mode  of 
intensity  increase.  The  following  analysis"  was  conducted 
for  this  purpose: 

i)  the  spectrum  of  the  differentiated  v-v  was 
computed  for  the  three  intensities  (at  a  fixed  pitch 
frequency)  at  which  the  subject  phonated.  These  spectra 
were  then  scaled  to  have  the  same  value  (OdB)  at  the  first 
pitch  harmonic  frequency. 

ii)  the  same  procedure  as  in  (i )  was  done  for  the  EGG 
signals  in  the  corresponding  tasks  and 
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Figure  4.17  Increasing  slope  at  closure  of  EGG  with  increasing 
intensity  at  a  fixed  fundamental  frequency 
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iii)  the  ratio  of  the  energy  above  1000Hz  to  the  energy 
below  1000Hz  was  computed  for  each  of  the  .spectra.  This 
ratio  has  been  suggested  as  a  suitable  measure  of  the 
"richness"  of  the  voice  in  (68). 

Note  that  the  first  mode  of  intensity  increase  will  not 
be  revealed  by  this  analysis  since  the  amplitude  of  the 
speech  signals  across  different  tasks  cannot  be  compared  due 
to  the  unknown  amplification  involved  in  the  recording  and 
playback  of  the  signals. 

The  computed  spectra  for  three  of  the  subjects  are 
shown  in  Figures  4. 18(a ) -4. 18(f ) .  The  corresponding  ratio 
values  are  in  Table  4.1. 

An  examination  of  the  figures  shows  that  in  3  of  the  5 
samples  (Figures  4.18(a),  4.18(b),  and  4.18(e))  the  EGG 
spectra  consistently  show  more  high  frequency  energy  with 
increasing  intensity.  The  spectra  of  the  differentiated  v- 
v,  on  the  other  hand,  while  exibiting  a  tendency  to  more 
higher  frequency  energy  with  increasing  intensity  are  not  as 
consistent.  As  an  example,  in  Figure  4.18(a),  the  spectrum 
of  the  differentiated  v-v  at  an  intensity  of  64dB  is  below 
that  of  the  spectrum  of  the  differentiated  v-v  at  60  dB. 

In  the  remaining  two  examples,  Figures  4.18(c)  and 
4.18(d),  the  spectra  of  the  EGG  do  not  change  significantly 
with  intensity.  The  differentiated  v-v  appears  to  show 
increased  energy  in  the  high  frequencies  at  the  higher 
intensities,  but  again  the  behaviour  is  not  consistent. 


Figure  4.18  Comparison  of  the  changes  in  the  spectra  of  the  EGG 
and  the  differentiated  glottal  v-v  with  intensity 
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Figure  4.18(a) 
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TABLE  4.1   RATIO  OF  ENERGY  ABOVE  1000  Hz  TO 
ENERGY  BELOW  1000  Hz 


JMN, 

125  Hz 

dB 

D  v-v 

EGG 

56 

0.1849E-2 

0.5630E-2 

60 

0.7081E-2 

0.5929E-2 

64 

0.5629E-2 

0.6505E-2 

JMN, 

125  Hz 

dB 

D  v-v 

EGG 

68 

0.3613E-2 

0.6400E-2 

72 

0.3134E-2 

0.8532E-2 

75 

0.9329E-2 

0.1770E-1 

DMK, 

125  Hz 

dB 

D  v-v 

EGG 

68 

0.2709E-2 

0.7991E-3 

74 

0.6186E-2 

0.1087E-2 

77 

0.1801E-1 

0.1058E-2 

DMK, 

125  Hz 

dB 

D  v-v 

EGG 

64 

0.2037E-2 

0.1516E-2 

70 

0.7285E-1 

0.2681E-2 

74 

0.5854E-2 

0.2428E-2 
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A  final   example  is  shown  in  Figure  4.18(f).    The 
subject   phonated   the   vowel   /a/   with   an   increasing 
intensity.   The  EGG  and  speech  signals  for  this  case  were 
directly  digitized  using  the  computer.    The  plot  of  the 
synchronized  glottal  waveforms  appears  in  Figure  4.15.   it 
is  seen  that  the  amplitude  of  the  v-v  pulse  increases  with 
increasing  intensity.   The  voice  fundamental  frequency  also 
increases  with  increasing  intensity.   Now  it  is  known  that 
an   increase   in   the   subglottal   pressure   is   usually 
accompanied  by  an  increase  in  the  pitch  frequency  (60). 
Thus,  all  the  evidence  points  to  the  mechanism  of  intensity 
increase   being   attributable   to   the   first   mode.     An 
examination  of  Figure  4.18(f)  shows  that  the  EGG  spectra  do 
not  change  significantly  with  increasing  intensity.    The 
spectra  of  the  differentiated  v-v  show  an  increase  in  the 
low  frequency  energy  with  increasing  intensity,  while  the 
high  frequency  behaviour  does  not  change. 
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Note 


Recently,  Flanagan,  Ishizaka  and  Shipley  (69)  used  the 
Ishi zaka-Fl anagan  model  in  an  analysis-by-synthesis 
approach  where  the  parameters  for  driving  the  model  are 
estimated  by  synthesizing  the  speech  output  for  different 
parameter  values,  and  finally  selecting  the  set  that 
leads  to  the  closest  match  (in  some  predetermi ni ed  sense) 
between  the  synthesized  speech  and  the  speech  being 
analyzed.  Once  this  is  done,  the  derived  parameters  can 
be  used  to  drive  the  model  and  thus  obtain  the  glottal 
v-v.  This  may  be  considered  as  a  method  of  inverse 
filtering.  However,  the  computational  requirements  are 
enormous;  for  example,  Flanagan  estimates  that  the  ratio 
of  computing  time  to  real  time  in  this  method  is  12000! 


CHAPTER  5 

TWO  CHANNEL  SPEECH  ANALYSIS  USING 
THE  EGG  AND  THE  ACOUSTIC  SPEECH  SIGNAL 


Introduction 


The  EGG  has  been  compared  with  the  ultra-high  speed 
films  and  the  glottal  volume  velocity  in  the  last  two 
chapters.  The  primary  motivation  was  to  develop  an 
understanding  of  the  EGG  and  its  principal  features.  The 
experimental  results  presented  demonstrate  that  the  EGG  is 
indicative  of  the  amount  of  lateral  glottal  contact. 
Furthermore,  features  in  the  EGG  allow  a  reliable  estimation 
of  the  instants  of  glottal  closure  and  opening.  In  this 
chapter,  we  apply  the  EGG  to  some  of  the  problems  in  speech 
analysis. 

Most  speech  analysis  systems  are  based  on  the  linear 
source-filter  model  of  Figure  5.1  (53).  The  model  allows 
two  classes  of  input  excitation  signals:  Quasi  peri odi c 
pulses  of  glottal  v-v  in  the  case  of  voiced  sounds,  and 
random  noise  for  the  unvoiced  sounds  of  speech.  The 
physiological  origin  of  the  excitation  for  voiced  speech  has 
been  discussed  in  Chapter  4.  The  random  noise  excitation 
models  the  acoustic  sound  source  created  by  the  turbulence 
of  air  at  a  constriction  of  the  vocal  tract  (53). 

The  different  sounds  in  normal  speech  are  produced  by 

changing  the  position  of  the  supraglottal  articulators, 
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Figure  5.1  Linear  Source  -  Filter  model  for  speech 
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which  changes  the  transmission  properties  of  the  vocal  tract 
(VT)  filter.  This  phenomenon  is  modeled  by  allowing  the  VT 
filter  in  Figure  5.1  to  be  time-varying.  However,  it  is 
reasonable  to  assume  that,  for  most  speech  sounds,  the 
properties  of  the  VT  filter  remain  fixed  for  periods  of 
10ms-20ms  (53).  Thus,  the  VT  filter  is  slowly  ti me- varyi ng , 
or  quasi  stationary.  The  reader  will  have  recognized  the 
model  of  Figure  5.1  as  a  minor  generalization  of  the  voiced 
speech  model  used  in  Chapter  4. 

The  purpose  of  speech  analysis  is  to  estimate  the 
parameters  of  a  speech  production  model  from  a  speech 
signal.  Thus,  in  the  context  of  the  model  of  Figure  5.1, 
the  basic  problems  of  speech  analysis  are  (70): 

i)  classification  of  the  speech  signal  into  voiced 
and  unvoiced  segments; 

ii)  determination  of  the  fundamental  frequency  (or  Fo) 
in  the  voiced  segments; 

iii)  estimation  of  the  glottal  v-v  in  the  voiced 
segments;  and 

iv)   estimation  of  the  VT  filter. 

Traditionally,  speech  analysis  algorithms  have  been 
implemented  using  the  speech  signal  alone.  The  purpose  of 
this  chapter  is  to  demonstrate  that  by  adding  the  EGG  as  a 
second  channel  of  information,  one  can  significantly  improve 
upon  the  analysis  of  each  of  the  problems  stated  above. 
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The  arrangement  of  this  chapter  is  as  follows: 

The   problems   of   voiced/unvoiced   classification   and 

pitch  period  estimation  are  usually  dealt  with  together. 

The  use  of  the  EGG  in  this  context  is  described  in  the  next 

section. 

The  problems  (111)  and  (iv)  above  also  go  together  in 
the  sense  that  an  accurate  estimation  of  either  the  glottal 
v-v  or  the  VT  filter  allows  a  determination  of  the  other 
quantity  to  within  the  limits  of  the  assumed  model  (63). 
The  use  of  the  EGG  for  these  purposes  is  described  in  the 
third  section. 

Finally,  we  conclude  the  chapter  with  a  discussion  of 
the  implicatons  of  our  results  and  possible  extensions  of 
the  analysis  techniques  presented  here. 

Voiced/Unvoiced  Classification  and  Fo  Estimation 


Pitch  period  estimation  (or  equi val ently ,  Fo 
estimation)  is  one  of  the  most  important  problems  in  speech 
analysis  (53).  Since  the  Fo  is  a  measure  of  the  periodicity 
of  the  speech  signal,  most  pitch  detection  schemes 
automatically  designate  a  speech  segment  as  unvoiced  if  the 
segment  is  aperiodic.  Thus,  voiced/unvoiced  classification 
is  automatically  built  into  the  Fo  estimation  algorithm. 


143 


Speech  Based  Autocorrelation  Method 


A  large  number  of  algorithms  for  Fo  estimation  from  the 
speech  signal  have  been  published  (54);  this  is  still  an 
active  area  of  research,  and  many  new  algorithms  will 
continue  to  be  published.  One  of  the  most  successful 
algorithms  is  based  on  the  short  time  autocorrelation  of  the 
speech  signal. 

Specifically,  let  x(n)  be  a  discrete  time,  periodic 
signal  with  a  period  of  P  samples,  i.e.,  x(n+kP)  =  x(n), 
k=0,l,...  Then  the  autocorrelation  function,  $(k),  of  x(n) 
is  defined  as  (53). 


♦  ( k  )  =  11m 


N 

E   x(m)x(m+k) 


N+»   (2N+1)  m=-N 

<|>(k)  has  the  properties  that 

i)  <fr(o)  is  a  maximum,  i.e.,  |*(k)|<$(o)  for  all  k, 

II)  *(k)  is  an  even  function,  i.e.  *(-k)=$(k),  and 

III)  4>(k  +  nP)  =  <},(k)  for  n  =  0,l,...  . 

The  properties  above  imply  that  the  autocorrelation 
function  is  a  maximum  at  k=0,±P,+2P  etc.  Hence,  a  simple 
algorithm  to  locate  the  peaks  in  $(k)  can  be  used  to 
estimate  the  pitch  period,  P. 

The  speech  signal  is,  however,  only  quasi  peri odi c  since 
the  periodicity  of  the  signal  changes  slowly  with  time.  In 
such  a  case,  the  speech  signal  is  multiplied  by  a  finite 
window,  w(n),  and  use  is  made  of  the  short-time 
autocorrelation  function, 
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Rn(k)  =  z        [x(n+m)w(m)][x(n+m+k)w(m+k)] 


If  w(n)  =  0  outside  the  range  n  =  0 ,1 ,. . .  ,  N-l  ,  then  the 
short  time  autocorrelation  function  is 


N-l-k 

R  (M  ■    I   [x(n+m)w(m)][x(n+m+k)w(m+k)] 
m  =  0 


If  the  speech  signal  is  periodic  and  N  is  sufficiently 
long,  then  Rn(k)  will  also  have  a  peak  at  multiples  of  the 
pitch  period,  P.  Thus,  these  peaks  can  be  located,  and  the 
pitch  period  estimated.  Further,  by  property  (i)  above, 
Rn(k)  can  be  normalized  so  that  Rn(o)  is  equal  to  1.  If  the 
speech  signal  is  unvoiced  (and  hence  aperiodic),  R  (k)  will 
not  have  any  strong  peaks  and  the  speech  segment  can  be 
labelled  un voi  ced . 

Note  that  the  short  time  autocorrelation  function  as 
defined  computes  the  autocorrelation  of  N  samples  of  the 
windowed  speech  signals  beginning  with  the  sample  at  time 
n.  Thus,  by  varying  n,  we  obtain  the  variation  with  time  of 
Fo,  or  the  pitch  contour  of  the  utterance. 

Practical  Implementation  of  the  Autocorrelation  Pitch 
Estimation  Al gori thm. There  are  a  number  of  important  issues 
that  need  to  be  considered  in  a  practical  implementation  of 
the  autocorrelation  method  of  pitch  estimation.  Since  the 
exposition  of  speech  based  pitch  estimation  algorithms  per 
se  is  not  the  purpose  of  this  study,  we  do  not  address  these 
issues  further,  but  refer  the  reader  to  the  material  in 
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(53,54).  Mr.  J.J.  Yea  at  the  Center  for  Mind-Machine 
Interaction  research  has  implemented  a  computer  program  for 
pitch  estimation  and  voiced/unvoiced  classification  that 
incorporates  most  of  these  considerations.  All  pitch 
contours  estimated  from  the  speech  signal  and  reported  in 
this  thesis  were  obtained  using  this  computer  program.  The 
program  uses  an  analysis  frame  size  (N,  in  equation  (1))  of 
300  samples,  with  an  overlap  between  successive  frames  of 
200  samples. 

Discussi  on.  The  short  time  autocorrelation  method  of 
pitch  estimation  basically  uses  the  similarity  of  the  speech 
signal  between  adjacent  pitch  periods.  Thus,  the  analysis 
window  size,  N  in  equation  (1),  needs  to  be  large  enough  to 
encompass  at  least  two  pitch  periods  of  the  speech  signal  at 
the  lowest  Fo  expected.  On  the  other  hand,  a  large  value  of 
N  means  that  for  small  pitch  periods  (i.e.,  high  Fo),  many 
pitch  periods  of  the  speech  waveform  will  be  contained  in 
the  analysis  window.  Consequently,  if  the  periodicity 
changes  over  this  segment,  the  estimated  pitch  period  value 
will  be  a  "smeared"  or  average  value  for  this  segment. 

While  the  discussion  above  has  been  restricted  to  the 
autocorrelation  method,  this  "smearing"  effect  is  inherent 
in  all  Fo  estimation  schemes  using  the  speech  signal. 
An  EGG  Based  Method 

Ideally,  the  classification  of  the  speech  signal  into 
voiced  and  unvoiced  regions  is  extremely  simple  if  the  EGG 
signal,  simultaneously  obtained  and  synchronized  with  the 
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speech  data  is  available  (see  note  1).  Since  the  EGG  is 
ideally  zero  during  unvoiced  intervals,  and  periodic  and 
nonzero  during  voiced  intervals,  a  simple  thresholding  of 
the  EGG  amplitude  should  suffice  to  separate  these 
segments.  While  this  approach  performs  satisfactorily, 
attention  to  a  few  practical  details  improves  the  results 
significantly.   These  considerations  are  discussed  shortly. 

The  estimation  of  the  voice  Fo  is  also  simple  using  the 
EGG.  The  results  of  our  previous  chapters  have  shown  that 
the  EGG  is  a  periodic  signal  with  exactly  two  zero  crossings 
per  period.  The  EGG  period  is  also  directly  a  result  of  the 
periodicity  of  vocal  fold  vibration.  Thus,  as  illustrated 
in  Figure  5.2,  the  pitch  period  can  be  estimated  as  the  time 
duration  between  two.  successive  "invariant"  features  in  the 
EGG.  For  example,  either  the  positive  or  negative  going 
zero  crossings  can  be  used  for  this  purpose.  Another  choice 
is  the  sharp  negative  spike  corresponding  to  glottal 
closure.  We  have  already  used  closure  locations  as  the 
feature  to  estimate  the  Fo  in  Chapter  3.  Comparison  with 
the  period  values  measured  from  the  glottal  area  showed  that 
the  EGG  estimated  pitch  periods  were  in  error  by  less  than 
0.5%  in  most  cases. 

At  this  point,  one  advantage  of  the  EGG  based  pitch 
detection  scheme  is  obvious;  namely,  the  method  computes  the 
pitch  period  on  a  period  by  period  basis.  Consequently,  the 
pitch  smearing  effects  inherent  in  speech  based  methods  are 
entirely  avoided. 
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PI,  P2,  and  P3  are  different  choices  for  measuring  the  pitch  period 
from  the  EGG 


Figure  5.2  Pitch  measurement  from  the  EGG 
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The  locations  of  the  glottal  closing  instants 
determined  from  the  EGG  allow  the  isolation  of  individual 
periods  of  the  speech  waveform  from  closure  to  closure. 
Within  each  such  period,  the  location  of  the  opening  instant 
is  used  to  further  divide  the  speech  signal  into  closed  and 
open  glottis  segments.  Such  "fine"  segmentation  of  the 
speech  waveform  is  essential  for  some  of  the  speech  analysis 
algorithms  discussed  later.  It  is,  therefore,  appropriate 
to  include  the  determination  of  the  closing  and  opening 
instants  as  part  of  an  algorithm  that  does  voiced/unvoiced 
classification  and  Fo  estimation  using  the  EGG. 

Practical  Impl ementi on .  A  speaker  continuously  adjusts 
the  position  of  his  larynx  while  he  says  a  phrase  or 
sentence  containing  different  speech  sounds.  Although  these 
movements  of  the  larynx  are  small,  they  are  sufficient  to 
alter  the  impedance  seen  by  the  electroglottograph 
electrodes.  Consequently,  a  slow  variation  in  the  DC  level 
is  superposed  on  the  normal  vibratory  signal  in  the  EGG. 
This  low  frequency  variation  needs  to  be  filtered  from  the 
EGG  signal  before  any  of  the  analysis  techniques  described 
above  can  be  applied.  A  351  point  FIR  linear  phase  band 
pass  filter  (20)  with  a  lower  cut  off  frequency  of  80  Hz  is 
used  for  this  purpose.   Appendix  C  describes  the  filter. 

The  EGG  varies  in  amplitude  across  speakers.  The 
amplitude  also  varies  across  an  utterance  by  a  speaker. 
Consequently,  a  fixed  thresholding  of  the  amplitude  is 
generally  inadequate  to  separate  the  voiced  and  unvoiced 
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segments.  Recall  now  that  the  EGG  has  exactly  two  zero 
crossings  per  pitch  period  during  the  voiced  segments.  In 
the  unvoiced  segments,  the  output  of  the  el ect rogl ottograph 
is  the  high  frequency  noise  generated  by  the  internal 
electronics  of  the  device;  this  noise  generally  has  a  high 
zero  crossing  rate.  Thus,  it  seems  appropriate  to 
accomplish  the  voiced/unvoiced  classification  using  a 
combination  of  the  EGG  amplitude  and  zero  crossing  rate. 

A    computer    program,    incorporating    all    the 
considerations  outlined  above,  has  been  implemented.   The 
program  is  described  by  the  flowchart  shown  in  Figure  5.3. 
The  EGG  signal  is  divided  into  frames,  each  frame  consisting 
of  300  points.  Successive  frames  overlap  by  200  points.   The 
maximum  EGG  amplitude  and  the  EGG  zero  crossing  rate  in  the 
frame   are   used   to   classify   the   frame   as   voiced   or 
unvoiced.   In  the  voiced  frames,  the  average  pitch  period  is 
computed   as   the   average   of   the   time   interval   between 
successive  positive  to  negative  zero  crossings.   The  EGG  is 
then  differentiated  using  the  filter  H(z)  =  1-z"1.    The 
opening  and  closing  instants  in  the  frame  are  located  using 
the  Diff  EGG  and  the  algorithm  described  in  Chapter  3.   All 
output  from  the  program  is  stored  in  disk  files  for  use  by 
subsequent  speech  analysis  programs. 
Resul ts 

We  compare  the  results  for  Fo  estimation  and 
voiced/unvoiced  classification  for  the  speech  based 
autocorrelation  method  and  the  EGG  based  method  in  Figures 
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1.  Divide  the  EGG  record  into  frames,  each  300  points  long  with  a  200 
point  overlap  between  frames.  Let  the  number  of  frames  =  NFRM. 

2.  For  each  frame  from  1=1,  ...,  NFRM, 

Find  the  maximum  in  the  EGG  record  in  the  frame.  Let  this  be  EMAX(i) 

3.  Find  the  maximum  and  minimum  values  of  EMAX(i),  i=l ,  . ..,NFRM. 
Let  these  be  BIG  and  SMALL  respectively. 

4.  Set  the  amplitude  threshold,  ATHRES,  as 

ATHRES  =  0.2*(BIG  -  SMALL)  +  SMALL. 

5.  For  each  frame  from  i=l,  ...,NFRM, 

i)   Remove  the  mean  from  the  EGG  record  in  the  frame. 

ii)  Compute  the  number  of  zero  crossings  in  the  frame,  ZCR(i). 

iii)  Set  BZCR  and  SZCR  as  the  maximum  and  minimum,  respectively, 
of  the  number  of  zero  crossings  expected  in  a  frame.  This 
depends  on  the  voice  type  (e.g.,  male,  female  etc.). 

6.  For  each  frame  from  i=l ,  ...,  NFRM, 

i)   If  ZCR(i)  is  not  between  SZCR  and  BZCR,  assign  the  frame  as 

unvoiced, 
ii)  If  EMAX(i)  ATHRES  and  ZCR(i)  is  between  SZCR  and  BZCR,  assign 

the  frame  as  voiced. 
iii)  If  EMAX(i)  ATHRES  and  ZCR(i)  is  between  SZCR  and  BZCR,  assign 

the  frame  as  voiced  if  frames  i-1  and  1+1  are  voiced;  else 

assign  it  as  unvoiced. 

7.  For  the  frames  assigned  as  voiced, 

i)   Determine  the  pitch  period  for  the  frame  as  the  average  of 

the  interval  between  successive  positive  to  negative  zero 

crossings  in  the  frame, 
ii)  Differentiate  the  EGG  and  use  the  algorithm  EGG-Closed-Open 

of  Chapter  3  to  locate  the  glottal  opening  and  closing 

instants  in  the  frame. 

8.  Store  the  voice/unvoiced  decision,  pitch  period  and  opening  and 
closing  instantsfor  the  frames  in  a  disk  file. 


Figure  5.3  Algorithm  for  voiced/unvoiced  decision  and  pitch  period 
estimation  using  the  EGG. 
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5.4,  5.5,  and  5.6.  In  these  figures,  the  graph  is  a  plot  of 
the  pitch'  contour  from  the  speech  (dotted  line)  and  the  EGG 
(solid  line).  Unvoiced  regions  have  been  assigned  a  pitch 
value  of  0. 

Figure  5.4  is  the  pitch  contour  for  the  utterance  "Ten 
pins  were  set  in  order"  spoken  by  a  male  speaker,  DMK. 
Figures  5.5  and  5.6  are  the  pitch  contours  for  the 
utterances  "We  were  away"  and  "Should  we  chase," 
respectively.  These  two  sentences  were  spoken  by  the  same 
male  speaker,  RAVI. 

An  examination  of  these  figures  shows  that  pitch 
contours  estimated  from  the  speech  signal  and  the  EGG  are 
virtually  identical  during  the  segments  both  methods 
assigned  as  voiced.  However,  there  are  a  number  of  segments 
that  have  been  assigned  as  voiced  by  the  EGG  based  algorithm 
but  as  unvoiced  by  the  speech  based  method. 

Figure  5.7  shows  the  EGG  and  speech  waveforms  for  the 
utterance  of  Figure  5.4  during  the  time  interval  from  0.75s 
to  0.83s.  This  corresponds  to  the  /s/  in  pins,  where  the 
classification  by  the  EGG  and  pitch  methods  is  in 
discrepancy.  The  nature  of  the  EGG  clearly  implies  that 
vocal  fold  vibration  is  taking  place.  The  speech  signal 
also  shows  periodicity,  albeit  with  a  reduced  amplitude,  and 
more  high  frequency  content.  This  is  an  example  of  a  mixed 
excitation  region,  where  both  voiced  and  unvoiced  excitation 
modes  are  present.  Most  speech  analysis  systems  do  not 
model   mixed   excitation   because   of   the   difficulty   of 


152 


158 
1«8 

^     129 

<S 

~     12* 
■^     110 

W   iea 

9a 

88 
78 
68 
S8 


48 

a 


SUBJ     I     DMK 


-EGG     i] ) 

.SPEECH     ( ) 


u 


; AZl 

■    v    I 


"A"' 
t\ 


j.+. ........ 


X 


:(  T 

rt 

./ i. 


-77 

/-■•' 
-f?l 

.188 


111 


i V 

■  '. J1S7 

i 

-bee 


l.a 


1.2 


1  .4 


PITCH  CONTOUR:  'TEN  PINS 


TIME  CS) 


Figure  5.4   Pitch  contour  for  the  utterance  "Ten  pins  were 
set  in  order,"  Subj  :  DMK.   EGG  method,  solid 
line;  Speech  method  dotted  line. 
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identifying  these  regions  from  the  speech  signal  alone. 
Typically,  identification  of  mixed  excitation  requires 
sophisticated  pattern  recognition  techniques  (71).  Our 
example  illustrates  one  advantage  of  two  channel  processing; 
not  only  are  mixed  excitation  regions  identified,  but  a  good 
estimate  of  the  "voice  pitch  in  these  intervals  is  also 
available. 

All  the  remaining  cases  of  discrepancy  in  the 
voiced/unvoiced  classification  occur  at  the  transition  from 
a  voiced  region  to  an  unvoiced  region.  Figure  5.8  is  a  plot 
of  the  EGG  and  speech  waveforms  for  the  utterance  of  Figure 
5.5  during  the  time  interval  from  1.06s  to  1.16s.  The  EGG 
shows  vocal  fold  vibration  slowly  coming  to  a  stop.  The 
speech  signal  during  this  time,  however,  is  very  much 
reduced  in  amplitude  -  in  fact,  the  signal  is  below  the 
noise  level  for  our  recording  conditions.  This  finding 
illustrates  two  facts: 

1)  Vocal  fold  vibration  does  not  stop  abruptly  at  the 
end  of  voicing,  but  slowly  delays  as  the  vocal  folds  come  to 
a  rest  position.  This  is  not  unreasonable,  since  the 
aerodynamic  forces  sustaining  vibration  change  their  state 
slowly  and  not  abruptly. 

2)  It  is  possible  for  vocal  fold  vibration  to  continue 
without  the  generation  of  any  significant  acoustic  energy. 
The  same  phenomenon  has  also  been  reported  in  (72). 

Pi  scussion.  The  EGG  provides  an  extremely  accurate  and 
reliable  method   for   computing   the  pitch  contour  of  an 
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utterance.  The  "pitch  smearing"  effect  inherent  in  speech 
signal  based  methods  is  avoided,  'and  pitch  values  are 
available  on  a  period  by  period  basis.  This  has  important 
applications  in  several  problems  where  accurate  pitch 
contour  estimates  are  desired  (73,74).  Thus,  the  EGG  based 
methods  provide  a  simple  alternative  to  the  computationally 
expensive  semi-automatic  pitch  detector  scheme  described  by 
McGonegal  et  al.  (73). 

Finally,  it  is  appropriate  to  compare  the  computational 
requirements  of  the  speech  and  EGG  based  methods.  The 
autocorrelation  speech  method  requires  extensive  computation 
for  the  calculation  of  the  short  time  autocorrelation 
function.  The  EGG  based  method,  on  the  other  hand,  utilizes 
simple  operations  such  as  thresholding  and  measurement  of 
the  zero  crossing  rate.  In  fact,  while  for  this  study,  both 
methods  were  simulated  in  software,  the  EGG  based  technique 
can  be  implemented  using  simple  hardware  and  can  perform  in 
"almost"  real  time.  Abberton  and  Fourcin  (75)  has 
constructed  such  a  hardware  system  for  the  display  of  the 
pitch  contour  using  the  EGG. 


Estimation  of  the  Vocal  Tract  Filter  and 
the  Glottal  Volume  Velocity 


We  have   already   discussed   the   fact   that   the   two 

problems  of  vocal  tract  filter  estimation  and  glottal  volume 

velocity  estimation  are  equivalent.   As  explained  in  Chapter 

4,   our  approach  to  automatical  l,y  obtaining  the  glottal 
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volume  velocity  is  to  first  obtain  an  accurate  model  for  the 
vocal  tract.  Inverse  filtering  the  speech  will  then  yield 
the  glottal  v-v.  Therefore,  we  first  concentrate  on  the 
estimation  of  the  vocal  tract  filter. 

The  speech  signal  and  the  vocal  tract  filter  can  be 
modeled  in  many  different  ways.  The  expositions  in  (53,54) 
contain  good  descriptions  of  these  approaches.  The 
separation  of  the  periodicity  of  the  glottal  volume  velocity 
from  the  filtering  imposed  on  it  by  the  VT  filter  is  a 
recurring  problem  in  most  of  these  analysis  techniques.  The 
EGG  should,  therefore,  be  of  use  in  alleviating  the 
periodicity  problem  in  all  these  cases.  We  shall 
concentrate  on  one  particular  analysis  technique:  linear 
predi  cti  on  analysi  s. 
Linear  Prediction  Analysis  of  Speech 

Linear  prediction  analysis  or  linear  prediction  coding 
(LPC)  of  speech  is  undoubtedly  one  of  the  most  successful 
analysis  techniques  applied  to  the  speech  signal.  The 
closed  phase  covariance  analysis  used  in  chapter  4  is  a 
particular  example  of  LPC. 

LPC  is  usually  introduced  using  the  model  for  speech 
production  shown  in  Figure  5.9.  The  reader  will  recognize 
this  as  yet  another  variation  of  the  ubiquitous  source- 
filter  model.  Note  that  the  VT  filter  is  assumed  to  be  all 
pole,  and  that  the  excitation  source  for  voiced  speech  is 
modeled  as  a  train  of  impulses  spaced  a  pitch  period  apart. 
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Figure  5.9  The  Linear  Prediction  model  for  speech 
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The  book  by  Markel  and  Gray  (64)  is  a  detailed 
exposition  of  LPC  as  applied  to  speech,  while  the  tutorial 
paper  by  Makhoul  (76)  discusses  the  use  of  LPC  as  a 
mathematical  tool  for  modeling. 

With  reference  to  the  model  of  Figure  5.9,  we  have  that 


s ( n )  =  -e  a.  s(n-i)  +  u(n) 
i=l   n 


where  s(n)  is  the  speech  signal, 


a.  ,  i=l,...,p   are  the  VT  filter  coefficients 
and  u(n)  is  the  inp.ut. 


,th 


The  p   order  linear  prediction  of  s(n)  is  defined  as 


s(n)  ■  -I  a.  s  (n-i  ) 
i  =  1 
and  the  prediction  error  is  defined  as 


e(n)  =  s(n)  -  s(n)  =  u(n) 


Now,  define  the  average  prediction  error, 


E  =  z  e'(n) 
n 


(2) 


where  the  range  of  the  index  is  purposely  left  unspecified 
at  present. 

The  basic  principle  behind  LPC  is  that  the  VT  filter 
coefficients  are  obtained  by  minimizing  E  in  equation  (5) 
with  respect  to  the  coefficients,  ai  ,  i=l,...,P. 
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Depending  on  the  range  of  summation  used  in  (2),  two 
different  methods  of  LPC  emerge:  the  autocorrelation  and 
the  covariance  method.  We  shall  briefly  summarize  the 
properties  of  these  two  methods  below  and  refer  the 
interested  reader  to  (76)  and  the  references  cited  earlier 
for  detailed  derivations.  Autocorrelation  method:  In  this 
method,  the  range  of  summation  in  (2)  is  theoretically 
infinite.  Clearly,  the  (always)  finite  data  record  length 
available  in  practice,  and  the  time  varying  nature  of  the  VT 
filter  mean  that  this  infinite  summation  range  has  to  be 
somehow  modified.  This  is  accomplished  by  multiplying  a 
finite  length  of  the  speech  data  by  a  window  that  is  zero 
outside  a  finite  interval.  Consequently,  only  a  finite 
number  of  terms  in  the  infinite  summation  are  nonzero,  and 
(2)  can  be  evaluated. 

The  primary  advantages  of  the  autocorrelation  method 
are 

i)   it  has  a  convenient  interpretation  in  the  spectral 
domain, 

11  )   the  VT  filter 


V(z)  = 


1 


P 
1  +  z     a 

i  =  l 


is  guaranteed  to  be  stable,  and 

iii)  fast  algorithms  exist  for  solving  the  resultant 
linear  equations  in  a^,    1=1,. ...p. 
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All  the  disadvantages  of  the  autocorrelation  method 
are  due  to  the  windowing:  (1)  the  analysis  window  needs  to 
be  at  least  100-300  points  long  and  (ii)  the  formant 
frequencies  and  bandwidths  estimated  may  be  in  error  due  to 
the  convolution  of  the  window  spectrum  with  the  spectrum  of 
the  speech  si  gnal  . 

Covariance  Method:  The  covariance  method  uses  only  a  finite 
range  of  n  for  the  summation  in  (2).  It  can  be  shown  that 
the  method  is  equivalent  to  a  least  squares  solution  of  the 
system  of  equations 

P 
s(n)  +   E   a.  s(n-i)  =  0    n=N  ,. . , N  +N-1 
i=l  1  oo 

Where  N  is  the  number  of  samples  in  the  analysis 
frame.  Note,  that  this  set  of  equations  is  exactly  the  same 
as  those  encountered  in  Chapter  4  when  we  considered  the 
closed  phase  covariance  analysis.  The  important  difference 
is  that  in  the  covariance  method,  the  above  equation  is 
considered  to  be  valid  for  all  n;  in  the  closed  phase 
method,  we  carefully  restricted  n  to  a  subset  of  the  samples 
in  the  closed  phase. 

The  primary  advantages  of  the  covariance  method  are 
that  (i)  the  analysis  length  N  can  be  small  and  consequently 
(ii)  no  windowing  of  the  speech  signal  is  involved. 


The  disadvantages  are  that  (i)  the  resulting  VT  filter, 


V(z)  = 
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1 


1  +  Z      a.  z 
i  =  l   7 


-i 


is  not  guaranteed  to  be  stable,  (ii)  the  method  has  no 
convenient  interpretation  in  terms  of  spectral  matching  and 
(iii)  the  solution  of  the  linear  equations  in  a-j's  requires 
slightly  more  computation  than  the  autocorrelation  method. 

Given  the  speech  signal  for  an  utterance,  the  usual 
procedure  for  applying  both  the  LPC  methods  described  above 
is  as  follows: 

1)  A  fixed  analysis  window  (or  frame)  is  chosen.  This 
is  usually  between  100-300  samples  for  a  10  KHz  sampling 
rate. 

2)  A  fixed  filter  order,  p  is  chosen.  This  is 
typically  in  the  range  12-16. 

3)  The  analysis  is  performed  over  the  analysis  window 
and  the  coefficients  a.j,  1*1,...,  p  are  computed. 

4)  The  analysis  window  is  then  shifted  to  include  new 
speech  samples;  typically,  there  is  an  overlap  of  samples 
between  successive  windows.  This  overlap  is  fixed  and 
usually  between  100-200  samples.  If  N  is  the  window  length 
and  NQ  the  number  of  samples  overlapped,  then  a  new  set  of 
filter  coefficients  is  available  every  N-NQ  samples.  This 
controls  the  analysis  frame  rate,  fr. 

The  procedure  for  LPC  described  above  operates  with  a 
fixed  window  length,  N,  and  a  fixed  frame  rate.fr.  Such  a 
scheme  is  termed  pitch-asynchronous  analysis  (64).  The  EGG, 
as   we   have  emphasized  earlier,   allows   individual   pitch 
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periods  of  voiced  speech  to  be  isolated.  In  such  a  case, 
the  analysis  window  can  be  placed  inside  an  individual  pitch 
period.  Such  an  analysis  scheme  is  termed  pitch-synchronous 
(64).  We  now  describe  three  possible  pitch-synchronous  LPC 
techniques : 

1)  Pitch-synchronous  covariance:  If  the  analysis 
window  is  made  to  coincide  with  a  single  pitch  period  and 
the  covariance  method  of  LPC  is  used,  we  call  the  analysis 
pitch-synchronous  covariance  LPC. 

2)  Pitch-synchronous  circular  autocorrelation:  The 
autocorrelation  method  as  described  earlier  is  not  directly 
applicable  in  a  pitch-synchronous  scheme.  This  is  because 
the  pitch  period  can  often  be  fewer  than  50  samples  long  (at 
a  10  KHz  sampling  rate).  Consequently,  the  windowing  used 
in  the  autocorrelation  method  can  have  a  deleterious  effect 
on  the  analysis  results.  There  is,  however,  one  other 
alternative:  a  single  pitch  period  is  considered  a  period 
from  a  completely  periodic  signal.  In  such  a  case,  the 
periodic  signal  is  known  for  all  time.  Furthermore,  the 
assumption  of  periodicity  means  that  the  autocorrelation  and 
the  covariance  method  are  the  same.  Thus,  the  method  seems 
to  incorporate  all  the  advantages  of  the  autocorrelation 
method  with  none  of  its  disadvantages.  The  method  is 
presented  in  (77,78).  However,  since  the  technique  is  not 
well  known,  we  derive  all  the  relevant  properties  of  the 
method  in  Appendix  D. 
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3)  Closed  phase  covariance:  A  detailed  analysis  of 
the  closed  phase  method  has  already  been  presented  in 
Chapter  4.  Recall  that  the  method  is  basically  the 
covariance  method  applied  to  the  segment  of  the  speech 
signal  in  the  closed  glottal  phase.  Theoretically,  this  is 
the  best  of  the  methods  presented  so  far  since  it  is  the 
only  one  that  explicitly  recognizes  the  source-tract 
interaction  effect. 

Note  that  in  all  the  pitch-synchronous  LPC  schemes 
outlined  above,  the  frame  rate,  fr  is  variable  and  tied  to 
the  pitch  period.  The  analysis  window  size,  N,  is  also 
variable  and  depends  on  the  pitch  period.  Furthermore, 
there  is  no  overlap  between  adjacent  analysis  windows. 

Whichever  LPC  method  is  adopted,  a  set  of  coefficients 
a.j  ,  i=l,...,p  is  obtained  for  each  analysis  window.  These 
coefficients  model  the  VT  filter,  V(z),  as 


V(z) 


The  roots  of  the  polynomial 


1  +  z      a.  z 
i=l   1 


-l 


A(z)  =  1  +   £   a.  z"1 
i=l  1 
generally  contain  the  roots  corresponding  to  the  resonance 

of  the  vocal  tract  as  a  subset  (64).   The  additional  roots 

usually  correspond  to  resonances  with  much  higher  bandwidth 

than  is  associated  with  VT  resonances.   Thus,  by  solving  the 

polynomial  A(z)  for  the  roots  and  eliminating  those  with  a 


166 


large  bandwidth,  we  can  estimate  the  formant  frequencies  and 
bandwidths  of  the  VT  system. 

For  most  speech  analysis  studies,  4-5  formants  are 
used.  The  variation  of  each  of  the  formants  with  time  is 
called  the  formant  contour  of  the  utterance. 

Discussion.  It  is  appropriate  to  pause  at  this  point 
and  take  stock  of  all  that  we  have  discussed,  and  outline 
where  we  are  heading.  To  begin  with,  two  basic  forms  of 
pitch-asynchronous  linear  prediction  analysis  were 
presented:  the  autocorrelation  and  the  covariance  method. 
Both  these  pitch-asynchronous  schemes  operate  at  a  fixed 
frame  rate  and  with  a  fixed  window  length.  Both  ignore  the 
quasi -peri odi c  nature  of  the  voiced  speech.  The  fine 
segmentation  of  the  speech  signal  provided  by  the  EGG  leads 
to  the  pitch-synchronous  LPC  method;  namely  the  pitch- 
synchronous  covariance,  the  pitch-synchronous  circular 
correlation,  and  the  pseudo  closed  phase  methods. 

Clearly,  if  the  pitch-asynchronous  methods  perform 
adequately,  there  would  be  no  need  to  consider  the  pitch- 
synchronous  methods.  Let  us  therefore  list  the  sources  of 
error  in  pitch-asynchronous  LPC  (6): 

(!)  Errors  due  to  voice  periodicity.  The  performance 
of  the  pitch-asynchronous  LPC  deteriorates  rapidly  with 
increasing  voice  fundamental  frequency  (6,79). 
Consequently,  the  LPC  analysis  cannot  be  applied  to  certain 
female  and  children's  voices  (79).  An  extensive  simulation 
study  by  Gish  (6)  has  shown  that  this  failure  with  high 
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pitched  voices  is  also  carried  over  to  pitch- synchronous 
covariance  analysis.  The  theoretical  considerations  behind 
the  pseudo-closed  phase  method  presented  in  Chapter  4  show 
that  this  method  is  immune  to  errors  due  to  voice  pitch  so 
long  as  there  are  an  adequate  number  of  samples  in  the 
closed  phase.  The  simulation  results  of  Gish  also  confirm 
this  fact  (6). 

(2)  Errors  due  to  glottal  v-v  shape.  Certain  voice 
types  where  a  harmonic  of  the  voice  pitch  is  close  to  the 
first  formant  frequency  lead  to  errors  in  pitch-asynchronous 
analysis.  What  happens  is  that  the  single  peak  due  to  the 
first  formant  is  often  split  into  two  peaks.  Since  this 
phenomenon  is  due  to  the  fact  that  more  than  one  pitch 
period  is  contained  in  the  analysis  window,  it  is  expected 
that  any  of  the  pitch-synchronous  analysis  schemes  will 
solve  the  problem. 

(3)  Errors  due  to  source-tract  coupling.  This  has 
been  discussed  in  chapter  4;  only  the  closed  phase  method 
solves  this  problem  by  restricting  the  analysis  to  the 
closed  glottal  segments. 

(4)  Errors  in  formant  tracking.  The  formant 
trajectories  estimated  by  pitch-asynchronous  LPC  often  show 
sudden  jumps  in  the  contour  that  are  incorrect.  The  fixed 
frame  rate  also  means  that  some  fast  formant  transitions 
cannot  be  tracked.   To  quote  Markel  and  Gray  , 
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In  performing  pitch-asynchronous  analysis  it  is 
important  to  filter  the  (formant)  trajectories,  as  they  do 
indicate  some  amount  of  random  behaviour  about  a  smooth 
curvature  as  would  be  expected  from  physiological 
constraints.  The  cause  of  the  slight  irregularities  is  the 
fixed  frame  analysis  rate  and  window  conditions.  Several 
periods  may  be  placed  at  varying  locations  within  any 
window.  Pitch-synchronous  analysis  may  resolve  this 
problem.  Unfortunately,  correct  pitch-synchronous  analysis 
is  much  more  difficult  to  perform  automati cal 1 y . ( 64 ,  paqe 
180)  P 

Thus,  we  have  the  following  situation:  Closed  phase 
analysis  can  theoretically  solve  all  four  problems  mentioned 
above.  This  has,  however,  never  been  verified  with  real 
speech  because  an  automatic  method  of  isolating  the  closed 
phase  was  not  available.  We  speculate  that  all  pitch- 
synchronous  analysis  schemes  can  solve  problems  (2)  and  (4) 
above.  But  this  too  has  not  been  verified  because  of  the 
difficulty  of  precisely  isolating  individual  pitch  periods. 

The  EGG,  as  we  have  repeatedly  stressed,  provides  an 
automatic  method  for  isolating  both  individual  glottal 
periods  and  the  closed  phase.  We  are,  therefore,  in  a 
unique  position  to  compare  the  different  methods  of  LPC  and 
test  the  conjectures  mentioned  above. 

Comparison  of  the  LPC  Methods.  Computer  programs  have 
been  implemented  for  four  of  the  LPC  methods:  pitch- 
asynchronous  autocorrelation,  pitch-synchronous  covariance, 
pitch-synchronous  circular  correlation  and  pseudo  closed 
phase.  The  pi tch-asychronous  covariance  method  was  not 
implemented  since  the  autocorrelation  and  covariance  methods 
are  known  to  perform  similarly  when  applied  pitch- 
asynchronously  with  long-window  lengths  (53,64). 
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The  analysis  conditions  for  the  different  methods  are 
as  follows  (the  sampling  rate  is  10  KHz  in  all  cases): 

(1)  Pitch-asynchronous  autocorrelation  (PAA): 
Frame  length  =  200  samples. 

Overlap  =  100  samples; 

frame  rate  =  100  analysis/sec. 

Hamming  window. 

Preemphasis  with  the  filter  1-0. 95Z"1. 

Analysis  filter  order,  p=14. 

(2)  Pitch-synchronous  covariance  (PSC): 

Frame  length  is  variable  and  dependent  on  the  pitch 

peri  od . 

No  overlap;  frame  rate  is  pitch  dependent. 

No  window. 

Preemphasis    with    the    filter    1-0 . 95z_1 . 

Analysis    filter    order,    p  =  14 . 

(3)  Pitch-synchronous    circular    correlation    (PSA): 
Frame    length    is    variable    and    equal    to    1    pitch    period. 
No    overlap;    frame    rate    is    pitch    dependent. 

No    window. 

Preemphasis  with  the  filter  1-0.95Z"1. 

Analysis  filter  order,  p=  14 . 

(4)  Pseu'do  closed  phase  (PCP): 

Frame  length  is  variable  and  equal  to  the  EGG  determined 

closed  phase. 

No  overlap;  frame  rate  is  pitch  dependent. 
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No   wi  ndow. 

Preempha'sis    with    filter    1-0. 95z-1 
Analysis    filter    order,    p=12. 
The    roots    of    the    polynomial 


A(z) 


E     a .    z 
1  =  1       ' 


+    1 


are  determined  for  each  analysis  in  each  of  the  methods. 
All  roots  with  a  bandwidth  larger  than  500  Hz  are 
eliminated.  The  remaining  roots  are  retained  as  potential 
formant  roots,  and  constitute  the  raw  formant  data.  No 
other  form  of  formant  trajectory  smoothing  or  filtering  is 
done. 
Resul ts 

Three  utterances  were  analyzed  by  the  above  four  LPC 
methods.  The  first  utterance  was  a  steady  phonation  of  the 
vowel  /a/  by  a  male  subject,  RAVI.  The  average  pitch  period 
for  the  utterance  was  4.9ms.  The  formant  trajectories  of 
the  first  four  formants  estimated  by  the  PCP,  PAA,  PSC,  and 
PSA  methods  of  LPC  are  shown  in  Figures  5.10,  5.11,  5.12, 
and  5.13,  respectively. 

An  examination  of  these  figures  shows  that  the  pseudo 
closed  phase  and  pitch-synchronous  covariance  methods  give 
the  best  results  for  the  formant  trajectories  in  the  sense 
that  there  are  very  few  sudden  jumps  in  the  contours.  The 
pitch-synchronous  circular  correlation  method  has  numerous 
errors  in  the  3rd  and  4th  formant  estimates.   The  pitch- 
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Figure  5.13  Pitch  synchronous  circular  correlation  method. 
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asynchronous  autocorrelation  method  includes  an  extra 
resonance  at  about  900  Hz.  A  plot  of  the  FFT  spectrum  of  a 
300  sample  Hamming  windowed  section  of  the  utterance  is 
shown  superposed  on  the  pitch-asynchronous  autocorrelation 
LPC  spectrum  in  Figure  5.14.  This  clearly  shows  that  the 
source  of  the  extra  formant  is  the  interaction  between  the 
pitch  harmonic  and  formant  frequency  alluded  to  earlier. 
None  of  the  pitch-synchronous  methods  show  this  extra 
formant. 

Figures  5.15,  5.16,  5.17,  and  5.18  are  the  formant 
contours  for  the  utterance  "We  were  away"  spoken  by  a  male 
speaker,  RAVI  for  the  PCP,  PAA,  PSC,  and  PSA  methods, 
respectively.  The  corresponding  pitch  contour  is  shown  in 
Fi  gure  5.4. 

The  formant  contours  estimated  by  the  PCP,  PAA,  PSC, 
and  PSA  methods  of  LPC  are  shown  in  Figure  5.19,  5.20,  5.21, 
and  5.22,  respectively  for  the  utterance  "Should  we  chase" 
spoken  by  a  male  speaker,  RAVI.  The  pitch  contour  for  this 
utterance  is  shown  in  Figure  5.5. 

An  examination  of  these  figures  shows  that  the  formant 
trajectories  estimated  by  the  pseudo  closed  phase  method  are 
the  best  in  the  sense  of  having  an  extremely  smooth 
variation  with  very  few  jumps.  If  we  exclude  the  region 
corresponding  to  the  /w/  in  "were"  (time  .55-. 65)  in  Figure 
5.15,  we  see  that  the  PCP  method  has  made  very  few  errors  in 
terms  of  missed  formants  or  extra,  non  formant  peaks.  The 
same  remarks  also  apply  to  Figure  5.19  if  we  exclude  the 
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Figure  5.14  Pitch  asynchronous  autocorrelation  spectrum  and 
and  100  point  Hamming  windowed  FFT  spectrum  of 
steady  vowel  /a/.   The  broad  peak  around  300  Hz 
corresponds  to  2  poles  at  750  Hz  and  900  Hz. 
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Figure   5.15    Pseudo   closed    phase   method. 
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Figure  5.16  Pitch  asynchronous  autocorrelation  method 
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Figure   5.17    Pitch    synchronous    covariance   method 
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Figure    5.18    Pitch    synchronous    circular   correlation   method. 
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Figure   5.19    Pseudo   closed    phase   method. 


i.i 


5. 


N 

X 


UJ  3. 

CtL 


l . 


-i » ■  ■        ■ — i — i ■-■ r- 


33, 


444 


4% 


44  3' 


S'JBJ     :    BflVI 
— r— i — • i-i — 


RAM    FORHRHT    TOTfl 


4  ,4444 

%44444* 


33 


3<4*4  3    34     444444334  , 

333  3J03J333333  ,,*43444     . 


443 


,333- 


2    2    ,       „333,       3 

23433         22222222    2 


22    222222,|332     1  2 

2  2 


32 


222< 


2222293? 


22222222 


33333332333 
~2£222    252- 


lii:Ut    III,       u-.«*»"«Ml„,l,  «lI"imi,,»«ll'l*I„lIi 


■44- 


.5 


■44  -, 


.? 


1  .» 


Pflfl    METHOD:     'SHOULD    WE    CHASE'       TIME     (S) 

Figure    5.20    Pitch    asynchronous    autocorrelation   method. 
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Figure   5.21    Pitch    synchronous    covariance   method. 
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region  corresponding  to  the  plosive  /d/  in  "should"  (time 
.53-. 58s  ). 

The  formant  contours  for  the  PAA  method  for  the  two 
utterances  are  shown  in  Figures  5.16  and  5.20.  We  see  in 
Figure  5.16  that  this  method  has  been  unable  to  follow  the 
fast  transitions  in  the  second  and  third  formants.  Many 
instances  of  missed  formants  or  extra  "formants"  are  also 
evident.  In  Figure  5.20,  such  errors  are  again  found. 
Furthermore,  here  it  is  very  difficult  to  trace  the  contour 
for  the  fourth  formant  which  seems  to  be  missing  most  of  the 
time. 

The  results  for  the  PSC  method  are  shown  in  Figures 
5.17  and  5.21.  The  method  performs  better  than  the  PAA 
technique,  but  problems  due  to  extra  or  missed  formants 
persist.  Notice  in  particular  the  poor  performance  in 
tracking  the  first  formant  in  both  these  figures. 

The  performance  of  the  PSA  method,  the  results  for 
which  are  shown  in  Figures  5.18  and  5.22,  is  the  worst  among 
all  the  techniques.  While  the  formant  contour  has  been 
sampled  at  a  higher  rate  than  for  the  PAA  method,  the  errors 
in  the  method  are  also  more  frequent. 
Automatic  Estimation  of  the  Glottal  Volume  Velocity 

We  have  already  emphasized  the  fact  that  only  the 
pseudo  closed  phase  method  is  capable  of  an  accurate 
estimation  of  the  glottal  v-v.  The  "glottal  v-v"  obtained 
by  the  other  methods  will  not  have  the  ripple  that 
corresponds  to  the  "predi storti on "  of  the  source  waveform  to 
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account  for  the  source-tract  coupling  effects.   See  Chapter 
4  for  a  further  clarification  of  this  point. 

In  discussing  glottal  inverse  filtering  in  Chapter  4, 
we  emphasized  the  highly  interactive  nature  of  the  present 
techniques.  This  limits  the  nature  of  the  utterances  that 
can  be  inverse  filtered.  Automatic  inverse  filtering 
schemes  are  a  must  if  the  dynamics  of  the  glottal  sound  are 
to  be  studied.  We  concluded  Chapter  4  with  the 
recommendation  that  the  PCP  method  be  used  as  a  tool  for 
automatic  inverse  filtering. 

Our  results  in  the  previous  section  have  shown  that  the 
PCP  method  is  capable  of  excellent  formant  tracking 
properties.  However,  the  requirements  for  inverse  filtering 
are  more  stringent  since  the  formant  frequencies  and 
bandwidths  need  to  be  accurate. 

Figure  5.23  is  the  differentiated  v-v  estimated  by  the 
PCP  method  for  the  utterance  "We  were  away."  The  inverse 
filtering  was  done  automatically,  in  a  pitch-synchronous 
manner,  utilizing  the  first  five  formant  frequencies  and 
bandwidths  estimated  for  the  period.  Note  that  except  for  a 
few  "patches"  where  the  method  has  failed,  the  inverse 
filtered  waveform  agrees  with  all  the  expected 
characteristics  for  the  signal.  The  presence  of  a  "flat" 
closed  phase,  ripple  due  to  the  source  tract  interaction  and 
the  sharp  slope  at  closure  imply  that  the  waveform  fits  all 
the  "requirements"  of  a  differentiated  v-v  waveform.  While 
it   is   just   a   single   example,   it   is   nevertheless   a 


Figure  5.23 


Plot  of  Speech  and  Differentiated  Glottal 
Volume  Velocity  Signal  for  the  Utterance 
"We  were  away .  " 
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Figure    5.23    (continued) 
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Figure    5.23    (continued) 
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Figure    5.23    (continued) 
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demonstration  of  the  fact  that  automatic  inverse  filtering 
using  the  EGG  is  possible. 

Pi  scussi  on .  Our  results  have  clearly  demonstrated  the 
superiority  of  the  closed  phase  method  of  analysis.  This 
method  has  not  only  superior  formant  tracking  properties, 
but  the  formant  frequencies  and  bandwidths  are  also 
accurate.  This  last  claim  is  based  on  the  "reasonable" 
inverse  filtered  waveforms  obtained  by  using  the  VT  filters 
estimated  by  the  PCP  method. 

The  pitch-synchronous  covariance  analysis  method 
provides  better  formant  tracking  than  a  pitch-asynchronous 
scheme.  However,  smoothing  the  formant  contours  is  still 
needed--in  particular,  the  first  formant  values  are  often  in 
error. 

The  results  in  this  chapter  also  show  that  the  effects 
of  source-tract  coupling  are  indeed  significant.  The 
opinion  in  the  speech  analysis  community  that  this 
phenomenon  does  not  significantly  affect  linear  prediction 
analysis  techniques  is  clearly  unfounded.  The  far  superior 
performance  of  the  closed  phase  method  over  the  pitch- 
synchronous  covariance  method  is  clear  evidence  of  this 
fact. 

The  EGG  is  a  valuable  supplement  in  speech  analysis.  A 
two  channel  speech  analysis  approach,  with  the  EGG  providing 
the  "fine"  segmentation  of  the  speech  waveform  opens  up  the 
door  to  accurate  and  reliable  formant  and  glottal  volume 
velocity  estimation.   This  has  far  reaching  implications  in 
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many  areas:  arti cul atory  synthesis,  voice  source  dynamics, 
high  quality  speech  synthesis,  visual  training  aids  for  the 
deaf  ,etc. 


Note 


The  EGG  signal  reflects  glottal  activity,  while  the 
speech  signal  is  measured  at  the  mouth.  Hence, 
simultaneously  collected  EGG  and  speech  signals  are 
delayed  by  an  interval  corresponding  to  the  acoustic 
propagation  delay  from  the  glottis  to  the  lips.  This 
duration  is  typically  between  0.6  and  0.8  ms.  Thus,  to 
synchroni  ze  the  two  signals,  the  EGG  needs  to  be  delayed 
by  this  interval  of  time.  A  visual  comparison  of  the  EGG 
and  the  speech  signal  typically  allows  this 
synchronization  to  be  achieved  to  within  0.3  ms.  It  is 
assumed  from  now  on  in  this  chapter  that  such  a 
synchronization  has  been  achieved. 


CHAPTER  6 
CONCLUSIONS 

Summary 


The  primary  purpose  of  the  research  reported  in  this 
dissertation  was  to  improve  our  knowledge  of 
el ect rogl ottography  and  the  relationship  between  the  EGG  and 
vocal  fold  vibration.  To  this  end,  the  EGG  was  compared 
with  data  obtained  from  synchronized  ultra-high  speed 
laryngeal  films  and  the  acoustic  speech  signal.  The  data 
base  for  this  comparison  consisted  of  a  controlled  set  of  36 
experiments  in  which  the  fundamental  frequency  and  the 
intensity  of  phonation  were  systematically  varied.  Four 
normal,  male  adults  participated  in  the  experiments.  Based 
on  the  results  of  this  comparison  we  can  conclude  that: 

i)  The  EGG  signal  is  indicative  of  the  lateral  area  of 
contact  between  the  vocal  folds.  This  lateral  area  of 
contact  cannot  be  measured  from  the  high  speed  laryngeal 
films,  and  so  direct  evidence  for  this  statement  is 
unavailable.  However,  we  have  shown  that,  based  on  our 
knowledge  of  the  structure  and  vibratory  behaviour  of  the 
vocal  folds,  we  can  explain  the  characteristics  of  the  EGG 
observed  during  the  different  glottal  phases  in  terms  of  the 
lateral  area  of  contact  between  the  vocal  folds. 
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Comparisons  of  the  EGG  with  the  glottal  area  and  the  length 
of  the  glottal  opening  also  support  our  hypothesis.  There 
is,  however,  a  cautionary  note  to  be  added  -  the  free  mucus 
on  the  superior  surface  of  the  vocal  folds  appears  to 
provide  a  low  impedance  path  for  the  radio  frequency  current 
used  in  the  el ect rogl ottograph.  Consequently,  in 
interpreting  the  EGG  as  a  function  of  the  lateral  area  of 
contact  between  the  vocal  folds,  it  is  necessary  to  include 
this  free  mucus  as  part  of  the  vocal  folds. 

ii)  The  EGG  is  an  excellent  indicator  of  the  vibratory 
period  of  the  vocal  folds. 

iii)  The  maximum  in  the  differentiated  EGG  occurs  very 
close  to  the  instant  when  the  vocal  folds  first  open  on  the 
superior  surface;  the  minimum  in  the  differentiated  EGG 
coincides  with  the  closing  instant  (i.e.,  the  instant  when 
the  projected  glottal  area  becomes  zero). 

iv)  The  identification  of  the  open  and  closed  phases 
from  the  EGG  is  useful  in  verifying  results  obtained  by 
inverse  filtering  the  speech  signal.  Thus,  the  EGG  is 
useful  in  predicting  the  temporal  characateri sti cs  of  the 
glottal  v-v. 

v)  The  EGG  appears  to  have  increasing  high  frequency 
energy  with  increasing  vocal  intensity.  However,  the 
spectral  energy  in  the  glottal  v-v  does  not  change 
consistently  with  changes  in  the  EGG  spectral  energy. 

We  also  considered  some  possible  applications  of  the 
EGG  to  problems  in  speech  analysis.   Specifically, 
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i)  A  method  of  voice/unvoiced  classification  and  Fo 
estimation  using  the  EGG  was  described.  The  method 
estimates  the  Fo  on  a  period  to  period  basis  and  is  suitable 
for  real  time  implementation. 

ii)  Three  pitch-synchronous  linear  prediction  analysis 
techniques  were  presented,  namely,  the  pitch-synchronous 
covariance  (PSC)  method,  the  pitch-synchronous  circular 
autocorrelation  (PSA)  method,  and  the  pseudo  closed  phase 
(PCP)  method.  The  PSC  and  the  PSA  methods  use  the  EGG  to 
segment  the  speech  waveform  into  individual  pitch  periods. 
The  PCP  method  further  divides  the  speech  waveform  in  each 
period  into  closed  and  open  phases  using  the  EGG.  The  three 
pitch-synchronous  methods  and  the  pi tch-asychronous 
autocorrelation  linear  prediction  method  were  used  to 
analyze  two  sentences.  A  comparison  of  the  formant  tracking 
ability  of  the  four  methods  revealed  the  superior 
performance  of  the  PCP  method.  This  experiment  clearly 
demonstrates  the  importance  of  including  the  source-tract 
interaction  effects  in  any  speech  analysis  scheme. 

An  automatic  inverse  filtering  scheme  using  the  PSP 
method  was  presented.  The  glottal  volume  velocity  was 
obtained  for  a  sentence  length  utterance  using  this  method. 
Directions  for  Future  Research 
A  number  of  problems  for  future  research  were 
identified  during  the  course  of  this  study.  These  can  be 
conveniently  divided  into  two  categories:  investigations 
into  the  basic  nature  of  the  EGG  and  applications  of  the  EGG 
in  speech  analysis. 
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Studies  of  the  EGG 

An  important  continuation  of  the  research  reported 
here  is  to  conduct  a  similar  study  using  different  voice 
types;  e.g.,  female  and  children's  voices,  voices  with 
laryngeal  disorders,  etc.  This  has  to  be  coupled  with 
efforts  to  mathematically  model  the  EGG  using  both 
physiologically  based  models  such  as  those  of  Titze  (27) 
and  parametric  models  (48).  The  speci f i cati onal  power  of 
these  models  has  to  be  verified  by  comparison  with 
experimental  data. 

The  voice  periodicity  and  perturbations  of  this 
periodicity  are  known  to  be  good  indicators  of  various 
laryngeal  pathologies  (15).  Smith  (15)  has  used  periodicity 
measures  obtained  from  the  EGG  to  discriminate  between 
normal  and  pathologic  voice  with  good  success.  Further 
studies  are  needed  to  extend  his  work. 

The  spectral  characteristics  of  the  EGG  and  the  glottal 
v-v  are  believed  to  be  related  via  the  glottal  closure 
phenomenon.  This  has  prompted  some  investigators  to  use  the 
EGG  as  a  training  aid  in  singing  pedagogy.  This  study  did 
not  find  any  consistent  relationship  between  the  spectral 
characteristics  of  the  EGG  and  the  glottal  v-v.  Comparisons 
of  singers  and  non-singers  voices  should  be  done  to  clarify 
this  point. 
Applications  of  the  EGG 

We  have  already  seen  that  the  EGG  can  be  used  for 
accurate,  reliable  and  real  time  estimation  of  the  voice 
fundamental  frequency,  Fo.   This  can  be  utilized  in 
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speech  training  aids  for  the  deaf  to  learn  proper 
intonation  patterns  (75), 

speech  training  aids  in  learning  the  intonation 
patterns  of  foreign  languages  (75), 

the  study  of  intonation  patterns  in  order  to 
deduce  rules  that  allow  the  generation  of  stylized  pitch 
contours  for  text-to-speech  synthesis  (73). 

We  have  also  studied  the  application  of  the  periodicity 
information  provided  by  the  EGG  in  implementing  pitch- 
synchronous  linear  prediction  analysis.  A  similar  use  of 
the  EGG  is  possible  in  speech  coding  techniques  such  as 
direct  sample  interpolation  (80)  and  time  domain  harmonic 
scaling  (81). 

Finally,  the  automatic  inverse  filtering  scheme 
described  in  Chapter  5  needs  to  be  improved.  The  author's 
opinion  is  that  the  incorporation  of  formant  tracking 
algorithms  and  an  accurate  closed  phase  analysis  method  will 
lead  to  a  robust  automatic  inverse  filtering  scheme  that  can 
be  used  with  a  wide  range  of  speakers  and  utterances.  This 
ability  to  separate  the  source  and  the  tract  is  of  potential 
benefit  in  almost  all  problems  in  speech  science. 


APPENDIX  A 
CALIBRATION  OF  THE  ELECTROGLOTTOGRAPH  CIRCUITS 


D.  Teaney  (12)  has  suggested  a  method  for  adjusting  the 
electroglottograph  circuits  to  ensure  that  the  device  does 
not  distort  the  waveforms  in  the  frequency  range  of 
interest.  The  principle  of  this  method  is  illustrated  in 
Figure  A.l.  A  square  wave  is  used  to  control  the  switches 
SMI  and  SW2,  which  switch  the  impedance  seen  by  the 
electroglottograph  electrodes  between  two  valves.  The 
output  of  the  electroglottograph  should  be  a  faithful 
reproduction  of  the  input  square  wave.  The  time  constants 
in  the  electroglottograph  are  adjusted  to  achieve  accurate 
reproduction  over  a  range  of  input  square  wave  frequencies. 

A  circuit  using  analog  switches  has  been  built  to 
calibrate  the  electroglottograph  as  outlined  above.  The 
input  square  wave  and  the  electroglottograph  output  for 
three  di fferent  input  frequencies  is  shown  in  Figure  A. 2. 
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K  refers  to  the  resistor 
value  in  Kiloohms. 


Figure  A.l  Circuit  to  test  the  electroglottograph, 


10  Hz 


100  Hz 


1000  Hz 


The  top  graph  is  the  input  square  wave  and  the  bottom  graph  is 
the  electroglottograph  output. 

Figure  A. 2  Square  wave  response  of  the  electroglottograph. 


APPENDIX  B 


TAPE  RECORDER  DISTORTION  CORRECTION 


Berouti's  method  for  tape  recorder  distortion 
correction  (19)  is  based  on  the  assumption  that  the  tape 
recording  and  playback  process  can  be  represented  as  a 
linear,  time  invariant  system.  Let  H(w)  be  the  Fourier 
transform  of  this  system.  Then,  if  S(w)  is  the  Fourier 
transform  of  a  recorded  and  played  back  signal,  the 
original,  undistorted  signal  has  Fourier  transform  U(w) 
given  by 

U(w)  =  S(w)  /  H(w) 

Practical  experience  has  shown  that  only  the  phase 
distortion  introduced  by  H(w)  is  important  and  the  magnitude 
of  H(w)  can  be  ignored.  Berouti  ,  therefore,  proposed  the 
following  procedure. 

i)   determination  the  phase  response  of  1/H(w) 

ii)  add  this  phase  response  to  the  phase  function  of 
S(w) 

iii)  take  the  inverse  Fourier  transform  to  obtain  the 
undistorted  signal  s(t). 

This  entire  procedure  is  accomplished  using  sampled 
signals  and  the  Fast  Fourier  Transform. 
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To  correct  the  tape  recorded  and  digitized  EGG,  the 
corresponding  traced  EGG  is  used  to  determine  the  phase 
correction  required;  consequently,  the  correction  parameters 
can  be  independently  determined  for  each  task.  We  show  in 
Figure  B.l  the  traced  EGG  (TR),  the  tape  recorded  EGG  (TP) 
and  the  corrected  version  of  the  tape  recorded  EEG  (TPC). 

The  speech  signal  for  all  the  tasks  in  corrected  using 
a  fixed  phase  correction.  This  correction  was  obtained  by 
recording  a  10  Hz  square  wave  on  tape  and  then  digitizing 
this  signal  on  playback.  Since  the  original  signal  is 
known,  the  phase  correction  can  be  derived.  We  also  show  in 
Figure  B.l  an  original  100  Hz  square  wave  (OD),  the  taped 
and  digitized  version  of  it  (TP)  and  the  square  wave  after 
tape  correction  (TPC). 
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Tape    correction    of    188    Hz    square    wave 

Figure   B.l       Illustration    of   tape    correction. 
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APPENDIX  C 


FIR  BAND  PASS  FILTER  RESPONSE 


An  FIR  linear  phase  filter  has  been  utilized  at  a 
number  of  places  in  this  study.  The  351  tap  filter  was 
designed  using  the  windowed  Fourier  series  method  (66).  The 
magnitude  frequency  response  of  the  filter  is  shown  in 
Figure  C.l. 
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Figure   C.l      Magnitude   frequency    response   of    FIR    linear   phase   filter, 


APPENDIX  D 


THE  PITCH  SYNCHRONOUS  CIRCULAR  CORRELATION  (PSA)  METHOD 


Let  s(n)  be  the  sampled  speech  signal.    Let  s(n), 
n=0,l, — ,  N-l  be  one  pitch  period  of  the  signal. 
Define  a  new  signal,  u(n)  as 


u(n)  =  s(k) 


,  n  -  -°°,...,0,1,... 


where  k  =  n  mod  N  is  the  remainder  on  the  decision  of  n 
by  N. 

Then  u(n)  is  a  strictly  periodic  signal  of  period  N. 

In  the  PSA  method,  the  time-varying  model  for  the 
speech  signal  s(n)  is  obtained  as  the  all  pole  linear 
prediction  model  of  the  periodically  extended  signal  u(n) 
for  each  pitch  period.  Thus,  the  signal  model  is  updated 
once  per  pitch  period  in  the  PSA  method. 

Since  u(n)  ie  periodic,  its  autocorrelation  function, 
def i  ned  as 


R  ( 1  )  =  1  i  m 


L+~  (2L-1)   n  =  -L 

l  N-l 

—  Z   u (n )u (n-i  ) 

N  N  =  0 


z   u  ( n  )  u  ( n  +  i  ) 


is    also    periodic    with    period    N. 
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The  coefficients  of  the  all  pole  modeling  filter,  a^  , 
i=l,...  ,p  are  obtained  as  the  solution  to  the  equation 


R(0) 
R(l) 


R(l) 

R(0) 


R(p-l)   R(p-2) 


R(p-l) 
R(p-2) 


R(0) 


al 

=    - 

'r(I)' 

!2 

R(2) 

• 
• 

a 

_    P. 

• 

R(P) 

Note  that  in  contrast  to  the  pitch-synchronous 
autocorrelation  method,  no  windowing  of  the  speechsignal, 
s(n),  was  involved.  Furthermore,  the  Toeplitz  form  of  the 
equation  above  guarantees  the  stability  of  the  all  pole 
filter. 


V(z)  = 


1  + 


P 

I 

i  =  l 


a.  z 


Thus  the  PSA  method  combines  the  advantages  of  the 
covariance  method  (no  windowing)  and  the  autocorrelation 
method  (guaranteed  stability  of  V(z)). 


APPENDIX  E 


FILE  NAMING  CONVENTION  USED 


This  study  has  made  use  of  numerous  measures  of  vocal 
fold  vibration;  e.g.,  the  glottal  area,  the  glottal  length, 
the  EGG,  etc.  The  synchronized  data  base  is  very  likely  to 
be  used  in  other  .  studies  of  vocal  fold  vibration. 
Consequently,  to  allow  easy  access  to  the  data  stored  in 
various  computer  files,  a  file  naming  convention  has  been 
adopted. 

Each  task  in  the  study  has  been  alloted  a  unique  9 
character  task  identification  (i.d.)  of  the  form 
XXXMMDDYT.  Here  XXX  identifies  the  subject,  MM  is  the  month 
in  which  the  data  was  collected,  DD  the  date  and  y  the  least 
significant  digit  of  the  year.  T  is  a  1  letter  code  used  to 
distinguish  between  different  tasks  of  the  same  subject  on 
the  same  day. 

Each  "measure"  of  vocal  fold  vibration  (e.g.  the 
length,  glottal  volume  velocity,  etc.)  is  identified  by  a  1 
letter  prefix  to  the  task  i.d.  Thus  A  is  used  for  glottal 
area,  E  for  EGG,  S  for  the  speech  signal,  L  for  the  length 
of   the   glottal   opening   and   V   for   the   glottal   volume 
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velocity.    A  2  letter  file  name  extension  provides  more 
specific  information  about  the  data  stored  in  the  file. 


Thus,  a  .UX  extension  refers  to  raw  data,  a  .CA  extension  to 
corrected  and  aligned  data,  a  .PR  extension  to  a 
differentiated  version  of  the  data  and  a  .FL  extension  to  a 
filtered  version  of  the  data. 

Thus,  the  file  names  are  all  12  characters  long 
(including  the  extension).   Examples  are: 

EJMN08022B.UX  -  raw  EEG  data  for  subject  JMN  filmed  on 
8/2/82. 

VAKK08192A.CA  -  corrected  and  aligned  glottal  v-v  for 
subject  AKK,  filmed  on  8/19/82. 
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